Top Banner
Dr. Stefan Czemmel, Quantitative Biology Center (QBiC) Lecture 3: Data sources ("Next- generation" technologies) Data Management for Quantitative Biology
52

Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Aug 07, 2015

Download

Education

QBiC_Tue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Dr. Stefan Czemmel, Quantitative Biology Center (QBiC)

Lecture 3: Data sources ("Next-generation" technologies)

Data Management for Quantitative Biology

Page 2: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Overview

• Coevolution of genomic achievements and sequencing

technologies

• Next Generation Sequencing (NGS) technologies as data sources

- Introduction to Illumina and PacBio Sequencing technologies

• Applications

• Summary and Outlook

2

Page 3: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

3

miRNAs

ubiquitin

Trend of flow of information not always followed: e.g. RNA increasingly recognized as more then just an information carrier

(mRNA): miRNAs and other small non-coding RNAs ribozymes and riboswitches

enormous complexity of transcriptome and proteome not reflected in the genome

Alberts, Molecular Biology of the Cell. 4th edit.

Histone modificationDNA methylation

Preface: The central dogma – classical view

Page 4: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Coevolution of genomic achievements and sequencing technologies

4

Sven Nahnsen
source of picture?
Page 5: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

5

1952 | Rosalind Franklin creates Photograph 51, showing a distinctive pattern that indicates the helical shape of DNA

1953 | James Watson and Francis Crick discover the double helix structure of DNA

1977 | Frederick Sanger develops rapid DNA sequencing technique

1983 | First genetic disease mapped, Huntington’s Disease

1983 | Invention of polymerase chain reaction (PCR) technology for amplifying DNA

1973 | First sequence of 24 bp published

1982 | Genbank started

1865 | Gregor Mendel, presents his research on pea hybridization to show that what we call “genes” nowadays determine inheritance of traits

Partially adapted from: https://unlockinglifescode.org/timeline

Coevolution of genomic achievements and sequencing technologies

Page 6: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

6

2000 | Genome sequence of model organism fruit fly reported

2001 | First draft of the human genome released

2002 | Mouse becomes first mammalian research organism with decoded genome

1987 | ABI Prism 373 (1st automated sequencing machine)

1996 | Capillary sequencer: ABI 310

2005 | 1st 454 Life Sciences NGS system : GS 20 System

2006 | 1st Solexa NGS sequencer: Genome Analyzer

2007 | 1st ABI NGS sequencer: SOLiD

2009 | 1st Helicos single molecule sequencer : Helicos Genetic Analyser

2011 | 1st Ion Torrent Sequencer : PGM

2011 | 1st Pacific Biosciences single

molecule sequencer : PacBio RS2012 | Oxford Nanopore Technologies

demonstrates ultra long single molecule reads

1990 | The Human Genome Project begins1995 | “shotgun” sequencing helped to sequence first bacterial genome: Haemophilus influenzae

ED Green et al. Nature (2010)

Coevolution of genomic achievements and sequencing technologies

Page 7: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Developments in sequencing allowed many genomes to be sequenced…

February 15, 2001

7

Populus trichocarpa~417 Mb

September 15, 2006

April 5, 2002

Homo sapiens~3259 Mb

Oryza sativa~426 Mb

Page 8: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Developments in sequencing revolutionized Humane Genome sequencing efforts

8

Illumina’s estimates that the number of sequenced human genomes will reach ~1.6 million genomes by 2017. (Francis de Souza (President of Illumina) at MIT Technology Review’s EmTech conference in Cambridge, Massachusetts)

Page 9: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Overview platform providers

9

71

10

16

3

Illumina

Roche

Life Tech

PacBio

Source: Mizuho Securities and GenomeWeb survey: No. of respondents: 103

Page 10: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Some of the sequencing technologies in Tuebingen

10http://www.illumina.com/systems/sequencing.html

HiSeq2500

HiSeq2000

MiSeq HiSeq3000

Page 11: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

The principle at the heart of all these technologies is the same :

Sanger Sequencing

11

Frederick Sanger13 August 1918 – 19 November 2013

Nobel Prizes in Chemistry 1958 and 1980

Prize motivation 1958: "for his work on the structure of proteins, especially that of insulin".

Prize motivation 1980 for him and his co-laureates Paul Berg, Walter Gilbert: "for their contributions concerning the determination of base sequences in nucleic acids”.

http://www.nobelprize.org/nobel_prizes/chemistry/

Page 12: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Sanger sequencing

http://en.wikipedia.org/wiki/Sanger_sequencing

12

Page 13: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Sanger versus second-generation sequencing

Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

13

Page 14: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

14

Next Generation Sequencing (NGS) technologies as data sources

Page 15: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

15

PacBio sequencing workflow

1. Sample/library preparation

2. Annealing of Seq Primer to SMRTbell Templates (hairpins)

3. Bind Polymerase (immobilization) to SMRTbell Templates (hairpins)

4. Sequencing

5. Data analysis

Page 16: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

16

PacBio sequencing workflow (3rd generation seq): single molecule approach

Metzker, Nature Reviews Genetics (2010)

Page 17: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

17

Illumina sequencing workflow (2nd generation seq): PCR-based approach

1. Sample/library preparation

2. Cluster generation

3. Sequencing and Imaging

4. Downstream data analysis for a typical RNA-Seq experiment

Page 18: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

18

1. Sample preparation: nucleic acid extraction

http://www.brown.edu/Research/CGP/download/illumina-public/R%20Sequerra%20-%20trouble%20shooting%20library%20preps.pdf

Page 19: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

19

1. Sample preparation: fragmentation and adapter ligation

http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

Sven Nahnsen
Ref... below red line
Page 20: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

20

1. Sample preparation: fragmentation

Expected library traces (analyzed using Bioanalyzer)

http://www.brown.edu/Research/CGP/download/illumina-public/R%20Sequerra%20-%20trouble%20shooting%20library%20preps.pdf

Page 21: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

21

Problematic library traces (analyzed using Bioanalyzer)

http://www.mbl.edu/jbpc/files/2014/05/Bioanalyzer_for_NGS_slideshow.pdf

1. Sample preparation: fragmentation

Tailing

Uneven shearing

Increased size range

Page 22: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

22

Attach library fragments to surface of flow cell

http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

2. Bridge amplification and cluster generation

Page 23: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

23

Flow cells

http://research.stowers-institute.org/microscopy/external/PowerpointPresentations/ppt/Methods_Technology/KSH_Tech&Methods_012808Final.pdf

Page 24: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

24

3. Sequencing and Imaging: Clonal Single Molecule Array

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ@DJB77P1:476:H15H9ADXX:1:1101:1807:1994 1:N:0:AGTTCCNGGACCTGGAATTACATCACCAATAGCATAGACACCTGAAACATTTGTAG+#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHH@DJB77P1:476:H15H9ADXX:1:1101:2215:1967 1:N:0:AGTTCCNTTTACTGCATCCCTGTGTTGGGTTGAGATTTTGGGTACTCTGAGATAAA+#4=DDFFFHHHHHJJJGHIIJJJGIJGJJJJJJJJJDHIJJIJIJJIIJJ

Flat files (*.bcl …)

Fastq filespartially adapted from: http://support.illumina.com/content/dam/illumina-marketing/documents/clinical/trusight-one-cardio/primer-ngs-cardiology.pdf

Page 25: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

25

4. Downstream Data analysis for a typical RNA-Seq experiment

4.1 Quality control (FastQC) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

4.2 Alignment to genome (using Tophat2/STAR/BWA …) (http://ccb.jhu.edu/software/tophat/index.shtml)(http://bioinformatics.oxfordjournals.org/content/29/1/15)(http://bio-bwa.sourceforge.net)

4.2.1 Manipulation of SAM/BAM files with Samtools(http://samtools.sourceforge.net)

4.3 Read counting (HTSeq)(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)

4.4 Statistical analysis for differential expression in R(http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html)(http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)(http://www.bioconductor.org/packages/release/bioc/html/limma.html)

Page 26: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

26

Fastq format

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ@DJB77P1:476:H15H9ADXX:1:1101:1807:1994 1:N:0:AGTTCCNGGACCTGGAATTACATCACCAATAGCATAGACACCTGAAACATTTGTAG+#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHH@DJB77P1:476:H15H9ADXX:1:1101:2215:1967 1:N:0:AGTTCCNTTTACTGCATCCCTGTGTTGGGTTGAGATTTTGGGTACTCTGAGATAAA+#4=DDFFFHHHHHJJJGHIIJJJGIJGJJJJJJJJJDHIJJIJIJJIIJJ

Single sequence (here 50bp long)

Line 1 : starts with '@’, contains sequence identifier and an optional descriptionLine 2 : raw sequence lettersLine 3 : starts with '+’, optionally followed by same info as in Line 1Line 4 encodes the quality values for the sequence in Line 2 (ASCII)

Page 27: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

27

5.1 Raw data inspection: Fastq format

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCCNCATCTGCTTCCAATTTGTTGGCCATCTTGGTAGCCGCATGGCATATCTC+#4=DFFFFHHHHHJJJJJJJJJJJIJJJJJJHIGIIIIJJJIIIJJJGIJ

Single sequence (here 50bp long)

@DJB77P1:476:H15H9ADXX:1:1101:1745:1986 1:N:0:AGTTCC

Unique instrument name

Flow cell lane

tile number within the flowcell lane

‘x’ and ‘y’ coordinates of cluster within the tile

Run ID Flow cell ID

The member of a pair, 1 or 2

Y if the read is filtered, N otherwise

Index sequence

0 when none of the control bits are on, otherwise it is an even number

Page 28: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

28

5.1 Raw data inspection: FastQC reports

Good Illumina 65bp long raw data Bad Illumina 40bp long raw data

Quality scores across all bases

Phr

ed s

core

(q)

Position in read (bp)

Quality scores across all bases

Position in read (bp)

Left side: in house dataRight side: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Page 29: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

)(log10 10 pq

• p=error probability for the base• if p=0.01 (1% chance of error), then q=20• p = 0.001, (0.1% chance of error), q = 30• Phred quality values are rounded to the nearest integer

29

Calculation Phred Quality Score

Page 31: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

31

5.2. Alignment/mapping: Visualization with e.g. IGV

Sox17 gene, mm10 genome

SNP’s

https://www.broadinstitute.org/software/igv/download

Page 32: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

32

5.3. Read counting (HTSeq)

Anders et al., Bioinformatics (2014)http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Page 33: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

33

5.4 Differential expression (DE) analysis in R

Czemmel et al., PLoS One (2014)Using DESeq package, Anders and Huber,

Genome Biology 2010 http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdfUsing edgeR package, Robinson et al., Bioinformatics (2010)

Broderick et al., BMC Plant Biol. (2014)Using DESeq2 package, Love et al., Genome Biology (2014)

Choice which statistical test to use with respect to experimental design at hand, see e.g.:

Luo et al., Genome Biology 2014

Page 34: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

34

Counting (RNA-Seq) and coverage (WGS)

RNA-Seq: 13 read counts for Gene A

WGS: 5x coverage at position X in Gene A2x coverage at position Y in Gene A

XY

Gene A

Page 35: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Experimental designing based on coverage/counting

WGS project: aim is to sequence the whole humane genome of three patients at 50x coverage on a HiSeq2500 High Output with 100bp Paired end reads (PE). How many lanes/flow cells are needed?

http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf

humane genome ~3.9Gb=3900MbHiSeq2500 High Output=~>=180e+06 PE reads per lanebp you get per lane=180e+06*200 = 3.6e+10bp coverage per lane per multiplexed genome=3.6e+10/(3900e+06*3) = 3x50x/3x >= 16 lanes = 2 flow cells (with 8 lanes each)

RNASeq project: aim is to sequence the transcriptome of three Arabidopsis plants treated with reagent A and three others with control reagent B on a HiSeq2500 Rapid Run v2 with 50bp single end reads (SE). To reach good scientific standard 20M reads per sample are needed. How many lanes/flow cells are needed?

Ath genome ~126MbHiSeq2500 Rapid Run v2 Output=~>=150e+06 SE reads per laneReads per sample on each lane = 150e+06/6 =25M reads per sample

Multiplex 3 samples per lane

Multiplex 6 samples per lane

Page 36: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Applications of NGS technologies

36

HiSeq2500

MiSeq

Whole genome (re-)sequencing (WGS)

Small non coding RNA sequencing (miRNAs …)

Chromatin Immunoprecipitation Sequencing (ChiP-Seq)

(Total) RNA-Seq

Bisulfite Sequencing (DNA methylation)

Targeted (re-)sequencing, e.g.

Whole exome sequencing (WES)

For many more see:http://www.illumina.com/content/dam/illumina-marketing/documents/products/research_reviews/sequencing-methods-review.pdf

mRNA-Seq

Targeted RNA-Seq

De Novo sequencing

Ribosome profiling

Epigenomics

Transcriptomics

Genomics

Page 37: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Transcriptomics: Most widely used technology before NGS: microarrays

37Top left: http://www.mun.ca/biology/scarr/cDNA_microarray_Assay_of_Gene_Expression.html

Page 38: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Genomics: De Novo Sequencing

www.illumina.com

38

Page 39: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

www.illumina.com

39

Genomics: Targeted Seq e.g. Whole exome Seq (WES)

Page 40: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Epigenomics

http://www.illumina.com/applications/epigenetics.html

40

Page 41: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

41

Applications

Page 42: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

42

Applications of NGS technologies

Page 43: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Rabbani et al., Journal of Human Genetics (2014)

43

Applications of WES in cancer research

Page 44: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Applications of de novo sequencing: De novo assembly of the Haitian cholera outbreak strain

Bashir et al., Nature (2012)

44

Page 45: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Yant et al., The Plant Cell (2010)

45

Applications of Chip-Seq: Identification of APETALA2 TF binding sites in the Arabidopsis genome

Page 46: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Li et al., The Plant Cell (2014)

46

Combination of WGS and sequence-capture bisulfite sequencing: Identification of genetic perturbations of the maize methylome

Page 47: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Application of small non coding RNA-Seq: Identification of novel miRNA biomarkers of muscle disease

Guess et al., PLoS One (2015)

47

Page 48: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

48

Method Advantages disadvantages

Sanger Lowest error rateLong read length (~750 bp)Low costs for small study

-High cost per base/for large studies-Long time to generate data-Need for cloning-Amount of data per run

Illumina Low error rateLowest cost per baseCan support de novo seq approaches performed with PacBio via high output yield

Shorter read length then e.g. PacBioHigh startup costsDe Novo assembly difficult

Summary: Highlighted advantages/disadvantages of NGS technologies

More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/

Page 49: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Summary: Highlighted advantages/disadvantages of NGS technologies continued

49More details: http://www.molecularecologist.com/next-gen-fieldguide-2014/

Method Advantages disadvantages

Ion Torrent Low startup costsMedium/low cost per baseLow error rateFast runs

-Costs higher then e.g. Illumina-Read length between Illumina and PacBio-Higher error rate than Illumina

PacBio Single molecule as templateLong reads-often used in conjunction with Illumina for de novo seq approaches

Still high error rateLow total no. of readsMedium/high costs per baseHigh startup costs

Oxford Nanopore

-Minion is a USB device-extremely low-cost-extremely long reads feasible

Unknow error rate

Page 50: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

50

Outlook

Page 51: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Is there a translation of NGS technologies into clinical diagnostics soon?

51

NGS technologies are promising in molecular diagnostics in which high sensitivity and specificity are required as they are:

+ able to provide single-nucleotide resolution+ constantly improve with more simplified and automated sample preparation

- the per-base-position error rate is still too high for most diagnostic tools (0.5–2%). - combination of various errors and variability arising from DNA fragmentation, sequencing library preparation, sequencing-by-synthesis and short reads alignment/assembly could incur a significant false-positive rate.

Su et al., Expert Rev Mol Diagn. 2011;11(3):333-343.

Page 52: Data Management for Quantitative Biology - Data sources (Next generation technologies), Apr. 30, 2015, Dr. Stefan Czemmel

Contact:

Quantitative Biology Center (QBiC)Auf der Morgenstelle 1072076 Tübingen · Germany

[email protected]

Thanks for listening – See you next week