Top Banner
Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760
33

Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Dec 18, 2015

Download

Documents

Cory Peters
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Introduction to Short Read Sequencing Analysis

Jim NoonanGENE 760

Page 2: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Sequence read lengths remain limiting

• For most applications reads are aligned to a reference genome• Short reads contain inherently limited information• De novo assembly of short reads is difficult

Chr1: 249 Mb

249 Mb sequencing read

Current platforms:• A moderate number (~500,000) of long reads (~10 kb)• A very large number (>200 M) of short reads (100 bp)

Page 3: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Need a computationally efficient method to perform accurate alignments of millions of reads

Aligning short reads to much larger reference

Page 4: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Read length requirements vary depending on the feature being studied

Exome:

80-120 bp

Transcriptome:

10,000 bp

Splice junctions(connectivity)

Page 5: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Exome or Genome

TranscriptomeConsiderations•Alignment scoring•Source of the reads•Sequencing format (PE or SE)•Read length•Error rates

Aligning short reads to much larger reference

Page 6: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Topics

•Mapability

•Error rates and quality scores for short read sequencing

•Common algorithms for short read sequence alignment

•Scoring short read sequence aligments

•Uniform data output formats

•Scoring alignments

Page 7: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Scoring alignments

TAGATTACACAGATTAC|||||||||||||||||TAGATTACACAGATTAC

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Adapted from Mark Gerstein

Correct:

Wrong:

C|C

Match (+1)

Mismatch (-1, -2, etc.)C

T

Gap penalty:

P = a +bNa = cost of opening a gapb = cost of extending gap by 1N = length of gap

A-TAC|||||ATTAC

A--AC|||||ATTAC

Many short read alignment algorithmsallow a fixed number of mismatches

Page 8: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Scoring alignments

TAGATTACTCAGATTAC|||||||| ||||||||TAGATTACACAGATTAC

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Adapted from Mark Gerstein

Correct (polymorphism):

Wrong:

C|C

Match (+1)

Mismatch (-1, -2, etc.)C

T

Gap penalty:

P = a +bNa = cost of opening a gapb = cost of extending gap by 1N = length of gap

A-TAC|||||ATTAC

A--AC|||||ATTAC

Many short read alignment algorithmsallow a fixed number of mismatches

Page 9: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Quality scores

A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A:

• The estimated probability that A is not correct is P(~A);

• The quality score for A is Q (A) = -10 log10 (P(~A))

A quality score of 10 means a probability of 0.1 that A is the wrong basecall.

Quality scores are logarithmic:

P(~A) is platform-specific; Q-scores can be compared across platforms.

Q-score Error probability

10 0.1

20 0.01

40 0.0001

Page 10: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Sequencingby synthesiswith reversibledye terminators

1 cycle

Scan flow cell

Add base

Reverse terminationAdd next base, etc.

Error rates in lllumina sequencing reads

Individual synthesis reactions go out of phase

Page 11: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Error rates in lllumina sequencing reads

• Error rates are mismatch rates relative to reference genome

• Reads may be trimmed to improve alignment quality

• Error rates increase with increasing cycle number

• Contingent on reference genome quality

Page 12: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Illumina quality score encoding in FASTQ format(CASAVA v1.8)

>90% Q30 bases in high quality run>80% mappable reads

Page 13: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Sources of error in single-molecule sequencing

Illumina:

PacBio:

TAGATTACACAGATTAC|||||||||||||||||TAGATTACACAGATTAC

Consensus signal

TAGATTA-ACAG-TT-C||||||| |||| || |TAGATTACACAGATTAC

One molecule, one read

Sequence templates multiple times

Page 14: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Mapability

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

Chr3 Chr7repeat repeat

Longer reads:

Paired reads:

Page 15: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Mapability scores at UCSC

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

36mers, 2 mismatches

75mers, 2 mismatches

100mers, 2 mismatches

Page 16: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Poorly mappable regions of the genome

36mers, 2 mismatches

75mers, 2 mismatches

100mers, 2 mismatches

Page 17: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Program WebsiteELAND (v2) N/A – integrated into Illumina pipelineBowtie http://bowtie-bio.sourceforge.net/BWA http://bio-bwa.sourceforge.net/Maq http://maq.sourceforge.net/

Common algorithms for mapping short reads to a reference genome

Considerations•Alignment scoring method•Speed•Quality aware•Seeding•Gapped alignment

Page 18: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Seed-based alignment strategy

Reference

Seed

Critical values are seed length and number of mismatches allowedIn ELAND:Seed length = 32Number of mismatches = 2

Single seed alignments

Multiseed alignments(ELAND v2, others)

Seed intervalcontingent on read length

Page 19: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Implementation in ELAND v2

A read must have at least one seed with no more than 2 mismatches and no gaps

Gapped alignment: extend each alignment to full length of read, allowing gaps up to 10 bp

Page 20: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Resolving ambiguous read alignments with multiple seeds

Reference

Seed

Page 21: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Resolving ambiguous read alignments with multiple seeds

Page 22: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Utility of gapped alignments

RNA-seq Insertions and deletion variants in exome and whole genome sequencing

Page 23: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Mapping paired end reads

Read 1 Read 2

Insert size

Insert size within specified range

Page 24: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.
Page 25: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

ELAND alignment scoring

Base quality values and mismatch positions in a candidate alignment are used to assign a p value

P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error ratescorresponding to the read’s quality values

Alignment score for a read is computed from p values of all candidatealignments

If there are two candidates for a read with p values 0.9 and 0.3:

• 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct

• 1- 0.75, chance highest scoring alignment is wrong

• Alignment score = -10 log(0.25) = 6.

Page 26: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

BaseSpace

https://basespace.illumina.com/

Page 27: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

alignment

Spaced-seed indexing of the reference genome

Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

• Need to break up the genome intomanageable segments

• Create index of short sequences

• Match seeds against genome index

Page 28: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Reference genome indexing usingBurrows-Wheeler transform

alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

• Reversible encoding scheme• Simplifies genome sequence• Results in “indexed” genome• Very rapid alignments

Page 29: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Bowtie 2

Pre-built Indexed genomes

Bowtie 1 and Bowtie 2indexes are not compatible

Page 30: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Alignments in Bowtie 2

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Multiseed alignment (ungapped) Seed length: 16 nt, every 10 nt# mismatches: 0

Mismatch = -6

TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTGTGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

RefRead

Gap = -11-5 to open

-3 to extend by 1 bp

Seeds are extended (gaps allowed) to generate alignment Match = 2

Page 31: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

http://bowtie-bio.sourceforge.net/manual.shtmlhttp://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

Mapping in highly repetitive regions

ELAND is conservative• Non-unique alignments are flagged; only one is reported in export.txt• Post-alignment CASAVA analyses ignore these

Bowtie will report non-unique alignments• User-specified options determine how these are reported

Page 32: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Sequence Alignment/Map (SAM) format

Standard format for reporting short read alignment data• BAM is compressed version

Header

Alignment info

http://samtools.sourceforge.net/

Page 33: Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760.

Summary

•Read the material posted for this lecture on the class wiki

•Next week: first Regulomics lecture