Next-generation sequencing: from basics to future diagnostics PART I: NGS technologies and standard workflow Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine
Jan 17, 2016
Next-generation sequencing:from basics to future diagnostics
PART I: NGS technologies and standard workflow
Sangwoo Kim, Ph.D.Assistant Professor,
Severance Biomedical Research Institute, Yonsei University College of Medicine
Overview
• PART I: NGS technologies and standard workflow– Next generation sequencing
• History and technology
– Data and its meaning; process workflow– Discussion
• PART II: NGS Analysis to find variants– NGS analysis to find variants
• Single nucleotide variants (SNVs)• Copy number variations (CNVs)• Structural variations (SVs)
• PART III: NGS application to diagnostics – NGS in genomic medicine– Potential application to forensic science
BACKGROUND
Conventional variant callingVariant calling in minor subgroups
6.7% of Japanese patients with NSCLC harbor a fusion of EML4 with the intracellular kinase do-main of ALK
PF-2341066 (Crizotinib) | cMet/ALK inhibitorPF-2341066 (Crizotinib)
57% response rate, 27% stable disease
“The FDA approved the Pfizer drug in 2011 based on 250 patients, four years after the ALK-mutation link was discovered. That is lightning speed in an industry ac-customed to spending a decade with thou-sands of test subjects to get drug ap-proval.”
McCarthy et al, 2013 Sci Transl Med.
Genomic medicine is a reality
The first breakthrough
Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc.“Rough draft” in 2000“Complete genome” published in 2003
13 years,$3 billion dollars.
The Human Genome Project (1990~2003)
The second breakthrough
Metzker et al, Nat Rev Genet, 2010
Massively Parallel Sequencing (a.k.a. Next-generation sequenc-
ing)
Illumina HiSeq2500
5500 SOLiD sys-tem
Ion Torrent PGM
via spatially separated, clonally amplified DNA templates or single DNA molecules
• Launched in 2008• Sequencing of 1092 individual genomes was announced in
2012• Great repository for population genomics
• Inaugural publication in 2009• Aims to assemble a genomic zoo (10,000 vertebrate
species)
• Project announced in 2013, aiming to accomplish in 5 years.• To identify cancer genes (regarding heterogeneity) and genetics of
rare diseases
Overwhelmed by data
Alex Sanchez, Introduction to NGS data analysis, 2012
“The challenges turns from data generation into data analysis!”
Overwhelmed by data
Alex Sanchez, Introduction to NGS data analysis, 2012
Elizabeth Pennsini , Science 2011
Overwhelmed by data
Alex Sanchez, Introduction to NGS data analysis, 2012
Elizabeth Pennsini , Science 2011
…“Within a few years, Ponting predicts, analy-sis, not sequencing, will be the main ex-pense hurdle to many genome projects. And that’s assuming there’s someone who can do it; bioinformaticists are in short supply every-where.”...
From data-poor to data rich
“ 과거의 ‘ classical’ bioinformatics 는 서열 상동성분석 , 정렬 , 재구성등에 대한 알고리즘이 주를 이루었습니다 . 하지만 고도로 병렬화된 대용량 생명정보는 단순 분석을 넘어서는 통합과 해석을 요구하기 시작했습니다 .”
“ 오늘날 데이터는 도처에서 생성됩니다 . 이제 데이터는 ‘그저 생성되기 마련’인 시대입니다 .”
- Prof. Ju Han Kim, SNU Conference on Biomedical Infor-matics
From data-poor to data rich env.
“ 과거의 ‘ classical’ bioinformatics 는 서열 상동성분석 , 정렬 , 재구성등에 대한 알고리즘이 주를 이루었습니다 . 하지만 고도로 병렬화된 대용량 생명정보는 단순 분석을 넘어서는 통합과 해석을 요구하기 시작했습니다 .”
“ 오늘날 데이터는 도처에서 생성됩니다 . 이제 데이터는 ‘그저 생성되기 마련’인 시대입니다 .”
- Prof. Ju Han Kim, SNU Conference on Biomedical Informatics
Prof. Atul Butte, Stanford Univ.
Hypothesis driven data → Data driven hypothe-sis
NEXT GENERATION SE-QUENCING
Conventional variant callingVariant calling in minor subgroups
Traditional Sequencing
1. Genomic DNA is fragmented, then cloned to a plasmid vector and used to transform E. coli
2. For each sequencing reaction, a single bacterial colony is picked and plasmid DNA isolated
3. Each cycle sequencing reaction takes place within a microliter-scale volume
Sanger Sequencing
Next Generation Sequencing
• No cloning– DNA to be sequenced is used to construct a library of
fragments that have synthetic DNAs (adapters) added covalently to each fragment end by use of DNA ligase
• Amplification can be done in parallel– Library fragments are amplified in situ on a solid surface
• Sequencing can be done in parallel (in 3 it-erative steps)– a nucleotide addition step– a detection step– a wash step
Illumina Sequencing
Illumina Sequencing
Illumina Sequencing
Illumina Sequencing
https://www.youtube.com/watch?v=HMyCqWhwB8E
Ion Torrent Sequencing
1. DNA capture on beads2. Single bead in a well3. Attach one nucleotide (A/T/G/C)
at one time4. Detect pH change
1. Measure the level of change for homopolymer detection
Ion Torrent Sequencing
Ion Torrent Sequencing
Ion Torrent Sequencing
Pacbio SMRT sequencing
zero-mode waveguide (ZMW) http://www.pacificbiosciences.com/products/smrt-technology/
Nanopore sequencing
https://www.youtube.com/watch?v=3UHw22hBpAk
Comparison
NGS DATA AND PROCESS-ING OVERVIEW
Conventional variant callingVariant calling in minor subgroups
FASTA format
A format for DNA (or protein) se-quence
FASTQ format (NGS raw data)
one read
se-quencequal-ity
A format for NGS read (FASTQ + qual-ity)
Mapping back to genome
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATC-CAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA
Where is this sequence in human genome?
Genome Informatics I (2015 Spring)
Quality
• Each basecall (a call for nucleotide – ‘A’,’T’,’C’,’G’) has its own quality– quality is a confidence of the machine
Phred scale [email protected] D4LHBFN1:204:D1B2UACXX:6:1101:1156:1996 length=101NCTCTCACCGAGCTCCACGAACGATAAGGGAATCAGTCTTAAAAGAGCCGCGAGTTACAGGCACACCTGAGAGAAAGAGATGTTTG-TATTCACCTTAGAAC+SRR1798798.1 D4LHBFN1:204:D1B2UACXX:6:1101:1156:1996 length=101#1:BDDDDF?FF@B>:ACFIBCGB3BF@C<?F9?DFBFCFEBFEFIFEIFFFDC>@ABBBB?BBBBBBBB?@:?AA@B@?(:4:>?<AB@:B@@B>>ABBB
Q = -10log10(e)
Probability of the base call being wrong10%, 1%, 0.1%,
0.01%...
10, 20, 30, 40…Quality score
+33
+,5,?,I…
ASCII code table
Kim S and Paik S, in preparation
control
sequenc-ing
quality control
short read alignment (BAM files)
sequenc-ingraw reads
(FASTQ files)
germ-line mutation somatic mutation
copy numbervariation (CNV)
structuralvariation (SV)
A. Data Genera-tion
B. Variant Find-ing
C. Variant Anal-ysis
xenogeneic sequence
43%0%
31% recurrence analysis
GKRRAGGGKRRAV*Gvariant impact prediction
mutation filtration/selection
tumor heterogeneity inference
disease
Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimiza-tion.Whole genome sequencing (WGS) allows
an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only in-terrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear.
D. Validation and functional assessment
variant confirmation
pathway analysis
functional study
DISCUSSION
Conventional variant callingVariant calling in minor subgroups