Top Banner
Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi
111

Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Dec 28, 2015

Download

Documents

Sibyl Nash
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Next Generation DNA Sequencing

IPM-NUS Workshop on Computational Biology

Mehdi Sadeghi

Page 2: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA sequencing methodologies: 1977

• Maxam-Gilbert – base modification by

general and specific chemicals.

– depurination or depyrimidination.

– single-strand excision.– not amenable to

automation

• Sanger– DNA replication.– substitution of

substrate with chain-terminator chemical.

– more efficient– automation?

Page 3: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA sequencing: Chemistry

Page 4: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA sequencing: Chemistry

template + polymerase +

dCTPdTTPdGTPdATP

ddATPddGTPddTTPddCTP

extension

electrophoresis

A•TG•CA•TT•AC•GT•AG•CG•CA•TG•CT•AT•AC•GT•AG•CA•T

Page 5: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Capillary electrophoresis

Page 6: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

ABI 370s-series

Page 7: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA sequencing: Computation

Page 8: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA sequencing

Page 9: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA SequencingGoal:

Find the complete sequence of A, C, G, T’s in DNA

Challenge:

There is no machine that takes long DNA as an input, and gives the complete sequence as output

Can only sequence ~500 letters at a time

Page 10: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Genome Sequencing

1515

ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…

ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Short fragments of DNA

AC..GCTT..TC

CG..CA

AC..GC

TG..GT TC..CC

GA..GCTG..AC

CT..TGGT..GC AC..GC AC..GC

AT..ATTT..CC

AA..GC

Short DNA sequences

ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT...

Sequenced genome

Genome

Page 11: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Sequencing strategies

Whole genome

Page 12: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

DNA sequencing – vectors

+ =

DNA

Shake

DNA fragments

VectorCircular genome(bacterium, plasmid)

Knownlocation

(restrictionsite)

Page 13: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Different types of vectors

VECTOR Size of insert

Plasmid2,000-10,000

Can control the size

Cosmid 40,000

BAC (Bacterial Artificial Chromosome)

70,000-300,000

YAC (Yeast Artificial Chromosome)

> 300,000

Not used much recently

Page 14: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Sanger sequencing

• DNA is fragmented• Cloned to a plasmid

vector• Cyclic sequencing

reaction• Separation by

electrophoresis• Readout with

fluorescent tags

Page 15: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Sanger Sequencing

• Advantages Long reads (~750bps) Suitable for small projects

• Disadvantages Low throughput Expensive

20

Page 16: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Method to sequence longer regions

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~500 bp ~500 bp

Page 17: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Reconstructing the Sequence (Fragment Assembly)

Cover region with ~7-fold redundancy (7X)

Overlap reads and extend to reconstruct the original genomic region

reads

Page 18: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Definition of Coverage

Length of genomic segment: L

Number of reads: n

Length of each read: l

Definition: Coverage C = n l / L

How much coverage is enough?

C

Page 19: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Assembly: How Much DNA?

24

many pieces to assemble

High coverage:

a few contigs, a few gaps

Low coverage:

A few pieces to assemble

many contigs, many gaps

Input OutputLander and Waterman,

1988

Page 20: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Challenges with Fragment Assembly

• Sequencing errors

~1-2% of bases are wrong

• Repeats

false overlap due to repeat

Page 21: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

RepeatsBacterial genomes: 5%Mammals: 50%

Repeat types:

• Low-Complexity DNA (e.g. ATATATATACATA…)

• Microsatellite repeats (a1…ak)N where k ~ 3-6(e.g. CAGCAGTAGCAGCACCAG)

• Transposons – SINE (Short Interspersed Nuclear Elements)

e.g., ALU: ~300-long, 106 copies– LINE (Long Interspersed Nuclear Elements)

~4000-long, 200,000 copies– LTR retroposons (Long Terminal Repeats (~700 bp) at each end)

cousins of HIV

• Gene Families genes duplicate & then diverge (paralogs)

• Recent duplications ~100,000-long, very similar copies

Page 22: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Strategies for whole-genome sequencing

1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go

Example: Rice genome

3. Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Page 23: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Whole-Genome Shotgun Sequencing

Page 24: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse paired reads

plasmids (2 – 10 Kbp)

cosmids (40 Kbp) known dist

~500 bp~500 bp

Page 25: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Assembly

48

Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994)

contig 1 contig 215Kbp mates

2Kbp mates

~(length―1,000)

~500 bp ~500 bp

resolving repeats

Better assembly of contigs, gap lengths estimation

Page 26: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Many years of hard work• More than 20.000 BAC clones• Each containing about 100kb fragment• Together provided a tiling path through each human

chromosome• Amplification in bacterial culture• Isolation, select pieces about 2-3 kb• Subcloned into plasmid vectors, amplification, isolation• recreate contigs • Refinement, gap closure, sequence quality improvement• (less 1 error/ 40.000 bases)• BAC based approaches toward WGS

Sequencing of Human Genome

Public Consortium

Page 27: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Sanger Sequencing

51

1980 1990 2000

1982: lambda virusDNA stretches up to 30-40Kbp (Sanger et al.)

1994: H. Influenzae1.8 Mbp (Fleischmann et al.)

2001: H. Sapiens, D. Melanogaster3 Gbp (Venter et al.)

2007: Global Ocean Sampling~3,000 organisms, 7Gbp (Venter et al.)

Page 28: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

52

2010: 5K$, a few days

2009: Illumina, Helicos40-50K$

Sequencing the Human Genome

Year

Log

10(p

rice)

201020052000

2012: 100$, <24 hrs?

2008: ABI SOLiD60K$, 2 weeks

2007: 4541M$, 3 months

2001: Celera100M$, 3 years

2001: Human Genome Project2.7G$, 11 years

Page 29: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

2nd Generation: Pyrosequencing

• Sequencing by synthesis

• Advantages:– Accurate– Parallel processing– Easily automated– Eliminates the need for labeled primers and

nucleotides– No need for gel electrophoresis

Page 30: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Pyrosequencing• Basic idea:

– Visible light is generated and is proportional to the number of incorporated nucleotides

– 1pmol DNA = 6*1011 ATP = 6*109 photons at 560nm DNA Polymerase I from E.coli.

pyrophospate

From fireflies, oxidizes luciferin and generates light

Page 31: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• 1st Method– Solid Phase

• Immobilized DNA• 3 enzymes• Wash step to remove nucleotides after each addition

Pyrosequencing

Page 32: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• 2nd Method– Liquid Phase

• 3 enzymes + apyrase (nucleotide degradation enzyme)– Eliminates need for washing step

• In the well of a microtiter plate:• primed DNA template• 4 enzymes

• Nucleotides are added stepwise

• Nucleotide-degrading enzyme degrade previous nucleotides

Pyrosequencing

Page 33: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Pyrosequencing

Page 34: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Pyrosequencing Results:

Page 35: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Disadvantages

• Smaller sequences

• Nonlinear light response after more than 5-6 identical nucleotides

Pyrosequencing

Page 36: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

60

Next Generation Sequencing: Why Now?

Page 37: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.
Page 38: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

62

High Parallelism is Achieved in Polony Sequencing

PolonySanger

Page 39: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Next Generation Sequencing

• DNA is fragmented

• Adaptors ligated to fragments

• Several possible protocols yield array of PCR colonies.– Emulsion PCR– Bridge PCR

• Enyzmatic extension with fluorescently tagged nucleotides.

• Cyclic readout by imaging the array.

Page 40: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Next Generation Sequencing

• 454 Life Sciences/Roche– Genome Sequencer FLX: currently produces 400-600

million bases per day per machine

– Published 1 million bases of Neanderthal DNA in 2006

– May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage)

• Solexa/Illumina– 10 GB per machine/week

– May 2008 published complete genomes for 3 hapmap subjects (14x coverage)

• ABI SOLiD– 20 GB per machine/week

Page 41: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

“Paradigm Shift”

• Standard ABI “Sanger” sequencing – 96 samples/day– Read length ~750 bp– Total = 70,000 bases of sequence data

• 454 was the game changer!– ~400,000 different templates (reads)/day– Read length ~250 bp– Total = 100,000,000 bases of sequence

data!!!

Page 42: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Solexa ups the Game

• Solexa (Illumina GA)– 60,000,000 different sequence templates

(yes that is an 60 million reads)

– 36 bp read length– 4 billion bases of DNA per run (3 days)

Page 43: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Each system works differently, but they are all based on a similar principals: – Shear target DNA into small pieces– bind individual DNA molecules to a solid surface, – amplify each molecule into a cluster– copy one base at a time and detect different

signals for A, C, T, & G bases– requires very precise high-resolution imaging of

tiny features (charge-coupled device (CCD) )

Page 44: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

454

• First high-throughput DNA sequencer, commercially

available in 2004• Now produces ~500 MB reads of 500 bp• Run of 8 samples in 10 hours, so can do multiple runs/week• Uses pyrosquencing, beads, and a microtiter plate • Low error rate, but insert/delete problems with

homopolymers (stretches of a single base)

Page 45: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.
Page 46: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Illumina Genome Analyzer

• Originally developed by Solexa, now subsidiary of Illumina.

• Commercially available in 2006• Now produces 8-12 million reads per sample of 36 bp

length = 10 GB/week. • Run takes 3 days for 7 samples.• Low error rate, mostly base changes, few indels

Page 47: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.
Page 48: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Call Sequence

Page 49: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

ABI-SOLiD

• First commercially available in late 2007• Currently capable of producing 20 GB of data

per run (week)• Most users generate 6 GB/run• Reads ~30 bp long• Uses unique

sequence-by-ligation method• “color-space” data• Very low error rate

Page 50: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.
Page 51: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Comparison of existing methods

Page 52: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

454 vs Solexa

• Read length: 400 bp• Number of reads: 400.000• Per-base cost greater• de novo assembly,

metagenomics

•Read length: 40 bp•Number of reads: millions•Per-base cost cheaper•Ideal for application requiring short reads

Page 53: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Applications• “If you build it, they will come.”• An explosion of scientific innovation!• Every new technology enables new

applications, which are not directly foreseen by the original developers of the tech.

• Cheap access to high-volume sequencing becomes a data collection method for many different types of experimental applications

Page 54: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Ancient DNA• DNA mixtures from diverse ecosystems, metagenomics• Resequencing previously published reference strains• Identification of all mutations in an organism• Expand the number of available genomes• Comparative studies• Deciphering cell’s transcripts at sequence level without knowledge of the genome sequence• Sequencing extremely large genomes, crop plants• Detection of cancer specific alleles avoiding traditional cloning• Chip-seq: interactions protein-DNA• Epigenomics• Detecting ncRNA• Genetic human variation : SNP, CNV (diseases)

Page 55: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Usage of sequencing data

• Transcriptome (RNA) sequencing• Differential expression• Alternative splicing

• Complete/targeted genome (DNA) resequencing

• Polymorphism and mutation discovery

Page 56: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

De Novo sequencing

• New species/strains• Challenge of assembly with short reads

– 8x coverage of 3 GB genome = 750 million fragments– Exponential problem for all-vs-all algorithm

• Again big problem with repeats• Assemble contigs, fill gaps• Paired-end reads are essential• Can sequence the entire genome of a microbe in

a single run

Page 57: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Assembly

Page 58: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Resequencing(mutation discovery/genotyping)

• A lot of current sequencing effort is spent on re-sequencing genomes of known species– Individual humans (1000 Genomes Project)– Experimental organisms – looking for genetic

variation, copy number variation• Challenge is to (quickly) align millions of

sequence reads to a reference genome with some % of mismatches

• Challenge to accurately call SNPs and indels• Problems with repeated sequences – both

tandem and dispersed repeats

Page 59: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Read length and pairing

• Short reads are problematic, because short sequences do not map uniquely to the genome.

• Solution #1: Get longer reads.• Solution #2: Get paired reads.

ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

Page 60: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

RNA Sequencing• “Digital Gene Expression” or “RNA-Seq”• Truly accurate gene expression measurements

– Can replace gene expression microarrays • 25% more sensitive• Does not rely on hybridization (no %GC bias, no cross-

hybridization between related genes)

• Discover novel genes (and other kinds of RNA

molecules) – one experiment found that 34% of human transcripts were

not from known genes• Sultan et al, Science. 2008 Aug 15;321(5891):956-60.

Page 61: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

More information from RNA

• Can capture true alternative splicing information– Sequence of splice-junctions

• One study found 4,096 previously unknown splice junctions in 3,106 human genes

– Different transcription start and end points for RNA molecules

• Allelic variation (SNPs) • Small RNAs

Page 62: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Metagenomics• Survey/discovery all of the species present in an

Environmental or Medical sample• “Human Microbiome”

– disease vs. healthy microbe populations in mouth, intestines, skin, reproductive tract, etc

• Complete multiple genome sequencing

• Complete multi-species transcript profiling (metabolic reconstruction)

• Deep sampling of genetic variation in microbial populations (frequency of drug resistant, toxin producing, etc.)

Page 63: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Informatics is the Bottleneck

• Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it

• Customized analysis / Bioinformatics consulting is needed for every project

Page 64: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Bioinformatics Challenges

• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of

algorithms to work in a parallel environment– Very large text files (~10 million lines long)– Impossible memory usage and execution time

Page 65: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Future Directions

• Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.

• complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.

• Data storage and analysis bottleneck• Data security/privacy issues

Page 66: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

genomic segment

AC..GCTT..TC

CG..CA

AC..GC

TG..GT TC..CC

GA..GCTG..AC

CT..TGGT..GC AC..GC AC..GC

AT..ATTT..CC

AA..GC

Short DNA sequences

ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…

ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Overview

Whole genome shotgun sequencing

Page 67: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Genomes • Transcriptomes• Metagenomes

• De Novo Assembly• Template Based Assembly

Page 68: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

De Novo sequencing

• New species/strains• Challenge of assembly with short reads

– 8x coverage of 3 GB genome = 750 million fragments (32 bp)

– Exponential problem for all-vs-all algorithm• Again big problem with repeats• Assemble contigs, fill gaps• Paired-end reads are essential• Can sequence the entire genome of a microbe in a

single run

Page 69: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Genoem Sequencing

• Assembly Algorithms– Shotgun sequencing assembly problem

• Find the shortest common superstring of a set of sequences.

• Given strings {s1, s2, …} find the shortest string T such that every si is a substring of T.

• This is NP-hard.

Page 70: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Greedy Algorithm

• Nodes are fragments

• Edges means there exist overlaps.

• Weight are number of overlaps found after calculateing pairwise alignments of all fragments.

Page 71: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Greedy Algorithm

• Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e

• Hamiltonian paths: A path that goes through every vertex

Page 72: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Greedy Algorithm

• Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.

• “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

Page 73: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Assembly Algorithms• Overlap-layout-consensus

–An assembler builds the graph –Output is a set of nonintersecting simple

paths, each path being a contigue.

Genoem Sequencing

Page 74: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Overlap-layout-consensus

• Overlap-layout-consensus method for assembly.– Build an overlap graph where each node

represents a read. An edge exists between two reads if they overlap.

– Traverse the graph to find unambiguous paths which form contigs.

Page 75: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Overlap graph for a bacterial genome.  The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right).  The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Overlap-layout-consensus

Page 76: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Next-generation sequencing

• Lower cost / base pair

• Very short fragment lengths (25-75bps)

• High error rate

• Inherent ability to do paired-end (mate-pair) sequencing.

Page 77: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Next-generation sequencing

• Challenging to assembly data.• Short fragment length = very small overlap

therefore many false overlaps

• Sequenced up to 100x coverage, increase in data size.

• Large number of reads + short overlap + higher error rate make traditional overlap - layout - consensus approach impractical.

Page 78: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Current approaches

• Euler / De Bruijn approach.

• Introduced as a alternative to overlap-layout-consensus approach in capillary sequencing.

• More suited for short read assembly.

Page 79: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Assembly Algorithms• Eularian path

– Eularian path – a path that visits all edges of a graph

– Breaks reads into overlapping n-mers.– Source – n-1 prefix and destination is the n-

1 suffix corresponding to an n-mer.– Basic problem is to find a path that uses all

the edges. – Eularian path is more efficient.

Genoem Sequencing

Page 80: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Eulerian Circuits and PathsEulerian Circuit – visits each edge in a graph exactly

once, and ends at the same vertex in which it started.

a-d-b-f-e-d-f-c-b-a is an Eulerian cycle in this particular graph

ab c

d fe

Eulerian Path – visits each edge in a graph exactly once.

a

b c

d

f

e

ji

h

g

h

a-b-c-d-e-f-g-c-h-f-i-j is an Eulerian trail in this particular graph

Page 81: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

De Bruijn Graphs

• Nodes are (k-1)-mers• Edges are k-mers

• The set of k-mers is called a k-spectrum

• Finding shortest string with given k-spectrum.

{AGC, ATC, ATT, CAG, CAT, GCA,

TCA, TTC}

CA

GC AG

TC AT

TT

Page 82: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Break each read sequence to overlapping fragments of size k. (k-mers)

• Form De Bruijn graph such that each (k-1)-mer represents a node in the graph.

• Edge exists between node a to b iff there exists a k-mer such that it’s prefix is a and suffix is b.

• Traverse the graph in unambiguous path to form contigs.

De Bruijn Graphs

Page 83: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• K = 4

De Bruijn Graphs

Page 84: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Eulerian Path Approach to DNA Fragment Assembly

• Ultimately, converts an NP-complete Hamilton Path Problem into a simplified Eulerian Path Problem through construction of a de Bruijn graph

•The number of ways to reconstruct the graph is equivalent to the number of paths which follow the respective directions and travel through all edges

•The resulting problem is that there are a number of different Eulerian Paths through this graph, and we cannot tell which would resemble the original path

Page 85: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Eulerian Superpath Problem

•Eulerian Superpath Problem – Given an Eulerian Graph and a collection of paths on this graph, find an Eulerian path in this graph that contains all these paths as subpaths.

•The original Eulerian Path Problem is a case of the Eulerian Superpath Problem, in which every path is a single edge.

Solving: Take graph G and the system of paths P, and transform these to a new graph G1 and a new system P1. With the goal in mind that there is a one-to-one correspondence (equivalence) between (G,P) and (G1,P1), we go on to make a series of these transformations.

(G,P) → (G1,P1) → (G2,P2) →…→ (Gk,Pk)

All these transformations should lead to a system Pk in which every path is represented by one edge. Since all transformations from beginning to end are equal, every solution of EPP in (Gk,Pk) will provide a solution to the ESPP in (G,P).

Page 86: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

An x,y-detachment for no multiple edges Let x = (vin,vmid) and y = (vmid,vout) be two consecutive edges in G and Px,y be all paths from P that include x,y as a subpath.

P→x is the paths from P that end on x and Py→ is the collection of paths from P that start with y.

Adding a new edge z = (vin,vout) to delete the edges x and y.

We can substitute z instead of x,y in all paths from Px,y, x in all paths from P→x, and y in all paths from Py→. Thus, reducing an ESPP to an EPP.

Page 87: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Elegant way of representing the problem.• Very fast execution.• Error correction can be handled in the graph.• De Bruijn graph size can be huge.

– ~200GB for human genomes.

• Does not use pair information in initial phase, resulting in overlay complicated graphs.

De Bruijn Graphs

Page 88: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Repeats

• Repeats in the sequence– Assembly programs should detect repeats in

the assembly process and not after. • Incorrect genome reconstruction

– Assemblers should try to resolve correctly as many repeats as possible.

Page 89: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

• Detecting repeats– Euler assembly program

• Finds repeats by complex parts of the graph constructed during the assembly process.

• Researchers look into these complex areas to try and resolve repeats.

• Assemblers can use clone mate (paired end) information to find incorrect assemblies. This is based on finding clone-mate pairs too close or too far from one another.

Repeats

Page 90: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

ASSEMBLY OF READS WITH ERRORS

• Errors in read data greatly complicate the task of fragment assembly.

• Error correction is performed prior to assembly by solving the error correction problem.

Page 91: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Resequencing(mutation discovery/genotyping)

• A lot of current sequencing effort is spent on re-sequencing genomes of known species

– Individual humans (1000 Genomes Project)– Experimental organisms – looking for genetic

variation, copy number variation• Challenge is to (quickly) align millions of sequence reads

to a reference genome with some % of mismatches• Challenge to accurately call SNPs and indels• Problems with repeated sequences – both tandem and

dispersed repeats

Page 92: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Need to alignment programs to map short sequencing reads from next-generation sequencing technologies to a reference genome are introduced

151

New Challenge

Page 93: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

given a set of reads R, for each read r ∈R, find its target regions on the reference genome G, such that for each target region t there are at most k mismatches between r and t.

152

The reads mapping problem

Page 94: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

153

Page 95: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Aligner algorithms can be divide in to two categories :

Seeded alignments algorithms (BLAST like)

Burrows-Wheeler transform based algorithms

154

Aligner algorithms

Page 96: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

BLAST is the most popular tool.Requires a query sequence to search for, and a

sequence to search againstStep 1: Make a k-letter word list of the query sequence.

Step 2: List the possible matching words

step 3: extend the match to find the high similarity pair

TAGGACCTAACC

GACCACCTTTT

155

TAGGACCTAACC

GACCACCTTTT

Seed alignment algorithm

Page 97: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Find seeded matches of 11 base pairs

Extend each match to right and left, until the scores drop too much, to form an alignment

Report all local alignments

Example: AGCGATGTCACGCGCCCGTATTTCCGTA TCGGATCTCACGCGCCCGGCTTACCGTG

| | | | | | | | | | | | | | | | || | |

0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0``

156

Blast algorithm

Page 98: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Spaced Seed: nonconsecutive matches and optimized match positions.

Represent BLAST seed by 11111111111 Spaced seed: 111010010100110111

1 means a required match0 means “don’t care” position

The length of the seed is the string length, and the weight of the seed is the number of 1s in the string.

This seemingly simple change makes a huge difference: significantly increases hit to homologous region while reducing bad hits.

157

Spaced seed

Page 99: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Multiple simultaneous seeds are defined as a set of seeds.∏= {seed1, seed2,…seed i,…, seedn}

∏ detects a similarity if at least one of the component seeds detects the similarity

ExampleSimultaneous seeds {1101, 1011} detect

similarities 100110100001, 1000010110001, 1101001011001

158

Multiple simultaneous seeds

Page 100: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

The prefix trie for string X is a tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from a leaf to the root gives a unique prefix of X.

On the prefix trie, the string concatenation of the edge symbols from a node to the root gives a unique substring of X .

The prefix trie of X is identical to the suffix trie of reverse of X and therefore suffix trie theories can also be applied to prefix trie

159

Page 101: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Let ∑ be an alphabet. Symbol $ is not present in and is lexicographically smaller than all the symbols in ∑

A string X=a0a1 ...an−1 is always ended with symbol $ (i.e. an−1=$)

Suffix array S of X is a permutation of the integers 0...n−1 such that S(i) is the start position of the i-th smallest suffix. 160

Page 102: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

For compute S(.), string X is circulated to generate strings, which are then lexicographically sorted.

161

Page 103: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

After sorting, the positions of the first symbols form the suffix array.

BWT(X) is the last column of the sorted matrix.

162

Page 104: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

163

Page 105: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Most algorithms for constructing suffix array require at least nlog2n bits of working space, which amounts to 12GB for human genome.

Recently, Hon et al. (2007) gave a new algorithm that uses n bits of working space and only requires <1GB memory at peak time for constructing the BWT of human genome

164

Page 106: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

If string W is a substring of X, the position of each occurrence of W in X will occur in an interval in the suffix array.

Based on this observation, we define:

R(W) = min{k :W is the prefix of XS(k)}R’(W) = max{k :W is the prefix of XS(k)}

(Xi=X[i,n−1] a suffix of X)In particular, ifW is an empty string, R(W)=1 and R’(W)=n−1.165

Page 107: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

The interval [R(W) ,R(W)’] is called the SA interval of W and the set of positions of all occurrences of W in X is

{S(k) :R(W) ≤k≤ R(W)’}

For example the SA interval of string ‘go’ is [1,2]The suffix array values in this interval are 3 and 0 which

give the positions of all the occurrences of ‘go’ in the “googol”. 166

Page 108: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Knowing the intervals in suffix array we can get the positions.

Therefore, sequence alignment is equivalent to searching for the SA intervals of substrings of X that match the query.

For the exact matching problem, we can find only one such interval

167

Page 109: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

We can compute SA intervals for all node in the trie and each read map equivalent to search the tree.

168

Page 110: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

We can compute SA intervals for all node in the trie and each read map equivalent to search the tree.

169

Page 111: Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

170