Next-generation sequencing: from basics to future diagnostics PART I: NGS technologies and standard workflow Sangwoo Kim, Ph.D. Assistant Professor, Severance.

Next-generation sequencing:from basics to future diagnostics

PART I: NGS technologies and standard workflow

Sangwoo Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute, Yonsei University College of Medicine

Overview

• PART I: NGS technologies and standard workflow– Next generation sequencing

• History and technology

– Data and its meaning; process workflow– Discussion

• PART II: NGS Analysis to find variants– NGS analysis to find variants

• Single nucleotide variants (SNVs)• Copy number variations (CNVs)• Structural variations (SVs)

• PART III: NGS application to diagnostics – NGS in genomic medicine– Potential application to forensic science

BACKGROUND

Conventional variant callingVariant calling in minor subgroups

6.7% of Japanese patients with NSCLC harbor a fusion of EML4 with the intracellular kinase do-main of ALK

PF-2341066 (Crizotinib) | cMet/ALK inhibitorPF-2341066 (Crizotinib)

57% response rate, 27% stable disease

“The FDA approved the Pfizer drug in 2011 based on 250 patients, four years after the ALK-mutation link was discovered. That is lightning speed in an industry ac-customed to spending a decade with thou-sands of test subjects to get drug ap-proval.”

McCarthy et al, 2013 Sci Transl Med.

Genomic medicine is a reality

The first breakthrough

Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc.“Rough draft” in 2000“Complete genome” published in 2003

13 years,$3 billion dollars.

The Human Genome Project (1990~2003)

The second breakthrough

Metzker et al, Nat Rev Genet, 2010

Massively Parallel Sequencing (a.k.a. Next-generation sequenc-

ing)

Illumina HiSeq2500

5500 SOLiD sys-tem

Ion Torrent PGM

via spatially separated, clonally amplified DNA templates or single DNA molecules

• Launched in 2008• Sequencing of 1092 individual genomes was announced in

2012• Great repository for population genomics

• Inaugural publication in 2009• Aims to assemble a genomic zoo (10,000 vertebrate

species)

• Project announced in 2013, aiming to accomplish in 5 years.• To identify cancer genes (regarding heterogeneity) and genetics of

rare diseases

Overwhelmed by data

Alex Sanchez, Introduction to NGS data analysis, 2012

“The challenges turns from data generation into data analysis!”

Overwhelmed by data


Elizabeth Pennsini , Science 2011

Overwhelmed by data


Elizabeth Pennsini , Science 2011

…“Within a few years, Ponting predicts, analy-sis, not sequencing, will be the main ex-pense hurdle to many genome projects. And that’s assuming there’s someone who can do it; bioinformaticists are in short supply every-where.”...

From data-poor to data rich

“ 과거의 ‘ classical’ bioinformatics 는 서열 상동성분석 , 정렬 , 재구성등에 대한 알고리즘이 주를 이루었습니다 . 하지만 고도로 병렬화된 대용량 생명정보는 단순 분석을 넘어서는 통합과 해석을 요구하기 시작했습니다 .”

“ 오늘날 데이터는 도처에서 생성됩니다 . 이제 데이터는 ‘그저 생성되기 마련’인 시대입니다 .”

- Prof. Ju Han Kim, SNU Conference on Biomedical Infor-matics

From data-poor to data rich env.

“ 과거의 ‘ classical’ bioinformatics 는 서열 상동성분석 , 정렬 , 재구성등에 대한 알고리즘이 주를 이루었습니다 . 하지만 고도로 병렬화된 대용량 생명정보는 단순 분석을 넘어서는 통합과 해석을 요구하기 시작했습니다 .”

“ 오늘날 데이터는 도처에서 생성됩니다 . 이제 데이터는 ‘그저 생성되기 마련’인 시대입니다 .”

- Prof. Ju Han Kim, SNU Conference on Biomedical Informatics

Prof. Atul Butte, Stanford Univ.

Hypothesis driven data → Data driven hypothe-sis

NEXT GENERATION SE-QUENCING


Traditional Sequencing

1. Genomic DNA is fragmented, then cloned to a plasmid vector and used to transform E. coli

2. For each sequencing reaction, a single bacterial colony is picked and plasmid DNA isolated

3. Each cycle sequencing reaction takes place within a microliter-scale volume

Sanger Sequencing

Next Generation Sequencing

• No cloning– DNA to be sequenced is used to construct a library of

fragments that have synthetic DNAs (adapters) added covalently to each fragment end by use of DNA ligase

• Amplification can be done in parallel– Library fragments are amplified in situ on a solid surface

• Sequencing can be done in parallel (in 3 it-erative steps)– a nucleotide addition step– a detection step– a wash step

Illumina Sequencing

Illumina Sequencing

Illumina Sequencing

Illumina Sequencing

https://www.youtube.com/watch?v=HMyCqWhwB8E



Ion Torrent Sequencing

1. DNA capture on beads2. Single bead in a well3. Attach one nucleotide (A/T/G/C)

at one time4. Detect pH change

1. Measure the level of change for homopolymer detection




Pacbio SMRT sequencing

zero-mode waveguide (ZMW) http://www.pacificbiosciences.com/products/smrt-technology/

http://www.pacificbiosciences.com/products/smrt-technology/

Nanopore sequencing

https://www.youtube.com/watch?v=3UHw22hBpAk

https://www.youtube.com/watch?v=3UHw22hBpAk

Comparison

NGS DATA AND PROCESS-ING OVERVIEW


FASTA format

A format for DNA (or protein) se-quence

FASTQ format (NGS raw data)

one read

se-quencequal-ity

A format for NGS read (FASTQ + qual-ity)

Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA

Where is this sequence in human genome?

Genome Informatics I (2015 Spring)

Quality

• Each basecall (a call for nucleotide – ‘A’,’T’,’C’,’G’) has its own quality– quality is a confidence of the machine

Phred scale [email protected] D4LHBFN1:204:D1B2UACXX:6:1101:1156:1996 length=101NCTCTCACCGAGCTCCACGAACGATAAGGGAATCAGTCTTAAAAGAGCCGCGAGTTACAGGCACACCTGAGAGAAAGAGATGTTTG-TATTCACCTTAGAAC+SRR1798798.1 D4LHBFN1:204:D1B2UACXX:6:1101:1156:1996 length=101#1:BDDDDF?FF@B>:ACFIBCGB3BF@C<?F9?DFBFCFEBFEFIFEIFFFDC>@ABBBB?BBBBBBBB?@:?AA@B@?(:4:>?<AB@:B@@B>>ABBB

Q = -10log10(e)

Probability of the base call being wrong10%, 1%, 0.1%,

0.01%...

10, 20, 30, 40…Quality score

+33

+,5,?,I…

ASCII code table

Kim S and Paik S, in preparation

control

sequenc-ing

quality control

short read alignment (BAM files)

sequenc-ingraw reads

(FASTQ files)

germ-line mutation somatic mutation

copy numbervariation (CNV)

structuralvariation (SV)

A. Data Genera-tion

B. Variant Find-ing

C. Variant Anal-ysis

xenogeneic sequence

43%0%

31% recurrence analysis

GKRRAGGGKRRAV*Gvariant impact prediction

mutation filtration/selection

tumor heterogeneity inference

disease

Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimiza-tion.Whole genome sequencing (WGS) allows

an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only in-terrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear.

D. Validation and functional assessment

variant confirmation

pathway analysis

functional study

DISCUSSION


Next-generation sequencing: from basics to future diagnostics PART I: NGS technologies and standard workflow Sangwoo Kim, Ph.D. Assistant Professor, Severance.

Documents

ngs data analysis

data generation

hypothesis driven data

data alex sanchez

data rich env

ngs application

ngs technologies

diagnostics ngs