Top Banner
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department
74

Next-generation sequencing: informatics & software aspects

Jan 02, 2016

Download

Documents

neve-hoover

Next-generation sequencing: informatics & software aspects. Gabor T. Marth Boston College Biology Department. Next-gen data. Read length. 20-60 (variable). 25-50 (fixed). 25-70 (fixed). ~200-450 (variable). 400. 100. 200. 300. 0. read length [bp]. Paired fragment-end reads. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Next-generation sequencing: informatics & software aspects

Next-generation sequencing:informatics & software

aspects

Gabor T. MarthBoston College Biology Department

Page 2: Next-generation sequencing: informatics & software aspects

Next-gen data

Page 3: Next-generation sequencing: informatics & software aspects

Read length

read length [bp]0 100 200 300

~200-450 (variable)

25-70 (fixed)

25-50 (fixed)

20-60 (variable)

400

Page 4: Next-generation sequencing: informatics & software aspects

Paired fragment-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

Korbel et al. Science 2007

• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)• instrumental for structural variation discovery

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Page 5: Next-generation sequencing: informatics & software aspects

Representational biases

• this affects genome resequencing (deeper starting read coverage is needed)• will have major impact is on counting applications

“dispersed” coverage distribution

Page 6: Next-generation sequencing: informatics & software aspects

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated into every clonal copy

Page 7: Next-generation sequencing: informatics & software aspects

Read quality

Page 8: Next-generation sequencing: informatics & software aspects

Error rate (Solexa)

Page 9: Next-generation sequencing: informatics & software aspects

Error rate (454)

Page 10: Next-generation sequencing: informatics & software aspects

Per-read errors (Solexa)

Page 11: Next-generation sequencing: informatics & software aspects

Per read errors (454)

Page 12: Next-generation sequencing: informatics & software aspects

Applications

Page 13: Next-generation sequencing: informatics & software aspects

Genome resequencing for variation discovery

SNPs

short INDELs

structural variations

• the most immediate application area

Page 14: Next-generation sequencing: informatics & software aspects

Genome resequencing for mutational profiling

Organismal reference sequence

• likely to change “classical genetics” and mutational analysis

Page 15: Next-generation sequencing: informatics & software aspects

De novo genome sequencing

Lander et al. Nature 2001

• difficult problem with short reads

• promising, especially as reads get longer

Page 16: Next-generation sequencing: informatics & software aspects

Identification of protein-bound DNA

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. (Robertson et al. Nature Methods, 2007)

DNA methylation. (Meissner et al. Nature 2008)

• natural applications for next-gen. sequencers

Page 17: Next-generation sequencing: informatics & software aspects

Transcriptome sequencing: transcript discovery

Mortazavi et al. Nature Methods 2008

Ruby et al. Cell, 2006

• high-throughput, but short reads pose challenges

Page 18: Next-generation sequencing: informatics & software aspects

Transcriptome sequencing: expression profiling

Jones-Rhoads et al. PLoS Genetics, 2007

Cloonan et al. Nature Methods, 2008

• high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

Page 19: Next-generation sequencing: informatics & software aspects

Analysis software(resequencing)

Page 20: Next-generation sequencing: informatics & software aspects

Individual resequencing

(iii) read assembly

REF

(ii) read mapping

IND

(i) base calling

IND(iv) SNP and short INDEL calling

(vi) data validation, hypothesis generation

(v) SV calling

Page 21: Next-generation sequencing: informatics & software aspects

The variation discovery “toolbox”

• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

Page 22: Next-generation sequencing: informatics & software aspects

1. Base calling

base sequence

base quality (Q-value) sequence

diverse chemistry & sequencing error profiles

Page 23: Next-generation sequencing: informatics & software aspects

454 pyrosequencer error profile

• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs

Page 24: Next-generation sequencing: informatics & software aspects

454 base quality values

• the native 454 base caller assigns too low base quality values

Page 25: Next-generation sequencing: informatics & software aspects

PYROBAYES: determine base number

Page 26: Next-generation sequencing: informatics & software aspects

PYROBAYES: Performance

• better correlation between assigned and measured quality values

• higher fraction of high-quality bases

Page 27: Next-generation sequencing: informatics & software aspects

Base quality value calibration

RawIllumina reads(1000G data)

Page 28: Next-generation sequencing: informatics & software aspects

Recalibrated base quality values (Illumina)

RecalicratedIllumina reads(1000G data)

Page 29: Next-generation sequencing: informatics & software aspects

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Unique pieces are easier to place than others…

Page 30: Next-generation sequencing: informatics & software aspects

Non-uniqueness of reads confounds mapping

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Page 31: Next-generation sequencing: informatics & software aspects

Strategies to deal with non-unique mapping

• Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)

0.8 0.19 0.01

read

• mapping to multiple loci requires the assignment of alignment probabilities (mapping qualities)

Page 32: Next-generation sequencing: informatics & software aspects

Longer reads are easier to map

454 FLX(1000G data)

Page 33: Next-generation sequencing: informatics & software aspects

Paired-end reads help unique read placement

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

Korbel et al. Science 2007

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

PE

MP

• PE reads are now the standard for genome resequencing

Page 34: Next-generation sequencing: informatics & software aspects

MOSAIK

Page 35: Next-generation sequencing: informatics & software aspects

INDEL alleles/errors – gapped alignments

454

Page 36: Next-generation sequencing: informatics & software aspects

Aligning multiple read types together

ABI/capillary

454 FLX

454 GS20

Illumina

• Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics

Page 37: Next-generation sequencing: informatics & software aspects

Aligner speed

Page 38: Next-generation sequencing: informatics & software aspects

3. Polymorphism / mutation detection

sequencing error

polymorphism

Page 39: Next-generation sequencing: informatics & software aspects

Allele calling in “trad” sequences

capillary sequences:• either clonal• or diploid traces

Page 40: Next-generation sequencing: informatics & software aspects

Allele calling in next-gen data

SNP

INS

New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection

Page 41: Next-generation sequencing: informatics & software aspects

Human genome polymorphism projects

common SNPs

Page 42: Next-generation sequencing: informatics & software aspects

Human genome polymorphism discovery

Page 43: Next-generation sequencing: informatics & software aspects

The 1000 Genomes Project

Page 44: Next-generation sequencing: informatics & software aspects

New challenges for SNP calling

• deep alignments of 100s / 1000s of individuals • trio sequences

Page 45: Next-generation sequencing: informatics & software aspects

Rare alleles in 100s / 1,000s of samples

Page 46: Next-generation sequencing: informatics & software aspects

Allele discovery is a multi-step sampling process

Population Samples Reads Allele detection

Page 47: Next-generation sequencing: informatics & software aspects

Capturing the allele in the sample

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1E-0

4

2E-0

4

5E-0

40.

001

0.00

20.

005

0.01

0.02

0.05 0.

10.

20.

5

Population AF

Pro

b(a

llele

cap

ture

d in

sam

ple

)

n=100

n=200

n=400

n=800

n=1600

Page 48: Next-generation sequencing: informatics & software aspects

Allele calling in deep sequence data

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

Page 49: Next-generation sequencing: informatics & software aspects

Allele calling in the reads

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

ii n

l kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

base call

sample size

GigaBayesGigaBayes

individual read coverage

base quality

Page 50: Next-generation sequencing: informatics & software aspects

More samples or deeper coverage / sample?

Shallower read coverage from more individuals …

…or deeper coverage from fewer samples?

simulation analysis by Aaron

Quinlan

Page 51: Next-generation sequencing: informatics & software aspects

Analysis indicates a balance

Page 52: Next-generation sequencing: informatics & software aspects

SNP calling in trios

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

2 2

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

2 2

M M M

F

C M F

F

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2

2

2

11:

2 1 12 : 2 12

22 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child

Page 53: Next-generation sequencing: informatics & software aspects

SNP calling in trios

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

mother father

childP=0.79

P=0.86

Page 54: Next-generation sequencing: informatics & software aspects

Determining genotype directly from sequence

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

A/C

C/C

A/A

Page 55: Next-generation sequencing: informatics & software aspects

4. Structural variation discovery

Page 56: Next-generation sequencing: informatics & software aspects

SV events from PE read mapping patterns

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

LMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

Page 57: Next-generation sequencing: informatics & software aspects

Deletion: Aberrant positive mapping distance

Page 58: Next-generation sequencing: informatics & software aspects

Copy number estimation from depth of coverage

Page 59: Next-generation sequencing: informatics & software aspects

Spanner – a hybrid SV/CNV detection tool

Navigation bar

Fragment lengths in selected region

Depth of coverage in selected region

Page 60: Next-generation sequencing: informatics & software aspects

5. Data visualization

1. aid software development: integration of trace data viewing, fast navigation, zooming/panning

2. facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays

3. promote hypothesis generation: integration of annotation tracks

Page 61: Next-generation sequencing: informatics & software aspects

Data visualization

Page 62: Next-generation sequencing: informatics & software aspects

New analysis tools are needed

1. Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing)

2. Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details)

3. Work-bench style tools to support downstream analysis

Page 63: Next-generation sequencing: informatics & software aspects

Data storage and data standards

Page 64: Next-generation sequencing: informatics & software aspects

What level of data to store?

images

traces

base quality values

base-called reads

Page 65: Next-generation sequencing: informatics & software aspects

Data standards

• Sequence Read Format, SRF (Asim Siddiqui, UBC)[email protected]

• Assembly format working grouphttp://assembly.bc.edu

• Genotype Likelihood Format (Richard Durbin, Sanger)

Page 66: Next-generation sequencing: informatics & software aspects

Summary

Page 67: Next-generation sequencing: informatics & software aspects

Conclusions: next-gen sequencing software

• Next-generation sequencing is a boon for mass-scale human resequencing, whole-genome mutational profiling, expression analysis and epigenetic studies

• Informatics tools already effective for basic applications

• There is a need both for “generic” analysis tools e.g. flexible read aligners and for specialized tools tailored to specific applications (e.g. expression profiling)

• Move toward tools that focus on biological analysis

• Most challenges are technical in nature (e.g. data storage, useful data formats, fast read mapping)

Page 68: Next-generation sequencing: informatics & software aspects

Software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release

Page 69: Next-generation sequencing: informatics & software aspects

Roche / 454 system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads

Page 70: Next-generation sequencing: informatics & software aspects

Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences • low INDEL error rate

Page 71: Next-generation sequencing: informatics & software aspects

AB / SOLiD system

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics

Page 72: Next-generation sequencing: informatics & software aspects

Helicos / Heliscope system

• short-read sequencer• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

Page 73: Next-generation sequencing: informatics & software aspects

Data characteristics

Page 74: Next-generation sequencing: informatics & software aspects

Data standards

• different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data)

• even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) • requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)