20140711 4 e_tseng_ercc2.0_workshop

FIND MEANING IN COMPLEXITY For Research Use Only. Not for use in diagnostic procedures.

© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.

Elizabeth Tseng / 2014.07.11 Staff Scientist

Technical Variability in PacBio® Full-length cDNA (Iso-SeqTM) Sequencing

SampleNet: Iso-Seq Method with Clonetech® cDNA Synthesis Kit

PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts

PolyA mRNA AAAAA

AAAAA

AAAAA

AAAAA

cDNA synthesis with adapters

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

Size partitioning & PCR amplification

SMRTbell™ ligation

PacBio® RS II Sequencing

Experimental Pipeline

Informatics Pipeline

Remove adapters Remove artifacts

Clean sequence

reads

Reads clustering

Isoform clusters

Consensus calling

Nonredundant transcript isoforms

Quality filtering

Final isoforms PacBio raw sequence

reads

5’ primer 3’ primer

Map to reference genome

Experimental pipeline Informatics pipeline

PacBio raw sequence reads

Figure 1

a b

AAAA

AAAA

AAAAAAAAAA

AAAAAAAAAAAAAAA

Size partitioning &PCR amplification

cDNA synthesiswith adapters

SMRTbell ligation

RS sequencing

Remove adaptersRemove artifacts

Reads clustering

Quality filtering

Cleansequence reads

Nonredundant transcript isoforms

Final isoforms

TTTT

TTTT

Consensus calling

Isoform clusters

Map to reference genome

Evidence-based gene models

polyA mRNA

AAAA

AAAA

TTTT

TTTT

AAAATTTT

AAAATTTT

AAAATTTT

AAAATTTT

Evidenced-based gene models

(AAA)n

(TTT)n

1 2 3 4 5

6 7 8 9 10

(TTT)n(AAA)n

Coding sequence polyA tail

SMRT® adapter

DevNet: Iso-Seq wiki page

(AAA)n Reads of Insert (AAA)n

Iso-Seq Full-length cDNA Library Protocol

3

polyA+ RNA

Total RNA

Optional Poly-A Selection

Reverse Transcription (SMARTScribe RT)

Full-‐length 1st Strand cDNA

PCR Optimization

Large-scale Amplification

Amplified cDNA

1-‐2 kb

2-‐3 kb

3-‐6 kb

Size Selection

1-‐2 kb

2-‐3 kb

3-‐6 kb

Re-Amplification

1-‐2 kb

2-‐3 kb

3-‐6 kb

SMRTbell™ Template Preparation

1-‐2 kb

2-‐3 kb

3-‐6 kb

SMRT® Sequencing

3-‐6 kb

Optional Size Selection

Iso-Seq Informatics Pipeline Per-molecule reads

Clusters of transcript alignments using FL + nFL reads

Transcript 1 Transcript 2 Transcript 3

Final transcript consensus


Full-length (FL) reads

Non-FL reads


Isoform-level clusters

Key Features of Current Iso-Seq Bioinformatics

•  Non-redundant, full-length, transcript consensus sequences –  No assembly –  De novo

–  Achieves high-quality consensus (≥ 99%) –  Universal PacBio features: robust to GC%, repeat structure, etc

•  Applications

–  Alternative splicing

–  Fusion transcripts

–  Alternative polyadenlyation –  (possible w/ proper protocol) Alternative start sites

Disclaimer

•  Everything shown from now on are transcripts/isoforms, not genes

•  Data shown is preliminary, very unbaked

•  Concept Analysis

Count Information Associated with Each Unique Transcript

Clusters of transcript alignments using FL + nFL reads


Final transcript consensus


Count matrix

Transcript Count Norm_Count

1 2 3 …

8 5 7 …

0.08 0.05 0.07 …

Count Information from non-FL reads

For non-FL reads: •  If uniquely associated with a transcript, assume it is the transcript •  If ambiguously associated, most likely because it’s a partial match

•  For now, weight of ambiguous nFL is just

read _ count = # of FL + # of unique nFL + weighted # of ambiguous nFL

1Number of associated transcripts

In current dataset, about 40-60% nFL reads partially match multiple isoforms (FL reads are always fully and uniquely associated)

Read Count Variation in Technical Replicates

Rat Heart •  Technical replicates (same starting RNA & protocol) •  3 size libraries (1 – 2 kb, 2 – 3 kb, 3 – 6 kb) •  Runs from diff sizes pooled for

bioinformatics pipeline

Boxplot of log2 read counts

Scatterplot of log2 read count for each transcript

Rat Heart, technical replicates

Read Count Variation in Technical Replicates

10

Rat Lung, technical replicates

All technical replicates were seq with total ~8 SMRT® Cells (low depth) Most NA transcripts are low counts

Choice of Chemistry Does Not Bias Sequencing

11

Rat Brain Same 3-size library (not technical replicate) •  Sequenced with P4-C2 chemistry •  Sequenced with P5-C3 chemistry

However for longer (> 3 kb) transcripts, P5-C3 chemistry will increase chance of seeing FL reads

Choice of PCR Enzyme May Bias Amplification

12

Human Brain, 2 – 3 kb library

Human Brain, 3 – 6 kb library

Current Iso-Seq Protocol Amplifies Sample Twice

13

polyA+ RNA

Total RNA

Optional Poly-A Selection

Reverse Transcription (SMARTScribe RT)

Full-‐length 1st Strand cDNA

PCR Optimization

Large-scale Amplification

Amplified cDNA

1-‐2 kb

2-‐3 kb

3-‐6 kb

Size Selection

1-‐2 kb

2-‐3 kb

3-‐6 kb

Re-Amplification

1-‐2 kb

2-‐3 kb

3-‐6 kb

SMRTbell™ Template Preparation

1-‐2 kb

2-‐3 kb

3-‐6 kb

SMRT® Sequencing

3-‐6 kb

Optional Size Selection

2nd Amplification Does Not Introduce Strong Bias

14

FL Read Length Distribution

Std. vs. skipping 2nd amp

Std. vs. skipping 1st amp Skipping 1st amplification results in size selection of first-strand cDNA that may be hard to optimize

Expected Transcript Variability in Different Rat Tissues

15

Rat Heart vs Rat Lung

Rat Heart vs Rat Brain

Heart Lung

Heart Brain

Conclusion

•  Technical variation not a big issue –  If done with same library protocol –  Different (PCR) enzymes bias amplification

–  Amplification can be tolerated if kept at reasonable # of cycles

•  Potential for DE –  Still many unknown factors –  Everything shown in previous slides merely “proof of concept”

–  With control comes better modeling

16

Looking Ahead

17

•  Detection limit •  Amplification bias

–  Adding control at known %

–  Factors: GC? Length? Enzyme?

•  Account for library pooling •  Ambiguous mapping •  Modeling bias •  DE isoform detection •  Combining short-read data

Wet Lab Bioinformatics

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

20140711 4 e_tseng_ercc2.0_workshop

Science