www.jornadasaludinvestiga.es Aplicaciones de la secuenciación genómica de nueva generación (Next Generation Sequencing, NGS) en estudios clínicos Javier Pérez Florido Antonio Rueda Martín Bioinformaticians Genomics and Bioinformatics Platform of Andalusia (GBPA)
102
Embed
Aplicaciones de la secuenciación genómica de nueva ... · Aplicaciones de la secuenciación genómica de nueva generación ... DNA sequencing - 2 ... Variant calling
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
www.jornadasaludinvestiga.es
Aplicaciones de la secuenciación
genómica de nueva generación
(Next Generation Sequencing,
NGS) en estudios clínicos
Javier Pérez Florido
Antonio Rueda Martín
Bioinformaticians
Genomics and Bioinformatics Platform of Andalusia
(GBPA)
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Why NGS?
• It works!
• Versatility of the data
• Key: the development of a technology
able to massively parallelize the
sequencing process drastically
reduces sequencing time and costs
History of DNA Sequencing
Nature 458, 719-724 (2009)
Basics of the “new” Technology→ Get DNA.
→ Fragment de DNA and attach adaptors.
→ Attach it to something (bead or glass)
→ Extend and amplify signal with some color scheme.
→ Detect fluorochrome by microscopy.
→ Interpret series of spots as short strings of DNA.
→ Simultaneously sequencing entire libraries of DNA sequence fragments.
NGS Technologies
Differences among sequencing platforms
• Nanotechnology used.
• Detection system
• Resolution of the image analysis.
• Chemistry and enzymology.
• Read length and number of reads
• Signal to noise detection in the software (Q scores)
• Run time
• Cost
Roche 454 Pyrosequencing
M.L. Metzker, Nature Review Genetics(2010)
Roche 454 GS Systems
GS FLX+
GS Junior
• 10 h. Sequencing
• Avg read lenght 400 bp
• Reads per run 100,000
• 40 Mbp
• 23 h. Sequencing
• Avg read lenght 700 bp
• Reads per run 1,000,000
• 700 Mbp
Illumina: sequencing by synthesis
• DNA fragments are ligated at both ends to
adapters
• DNA fragments are immobilized at one
end on a solid support
• Single-stranded fragments create a
“bridge” structure
• Adapters act as primers for PCR
amplification
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
• Incorporation of each nucleotide is detected by a
CCD camera
• Terminators are removed and synthesis repeated
• Just one nucleotide is incorporated in each cycle
Illumina Sequencers
HiSeq 2500
MiSeq
Max Output
1,000 GbMax Read Number
4,000 M
Max Read Length
2x125 bp
www.illumina.com
Max Output
15 Gb
Max Read Number
25 MMax Read Length
2x300 bp
Max Output
120 Gb
Max Read Number
400 MMax Read Length
2x150 bp
NextSeq 500
Solid 5500
200 Gb/run
35-75 bp fragments
1.8 - 4.8 billion reads/run
2x6 lanes/run
96 bar-codes
ECC: 99.99% accuracy
Colorspace reads
Third Generation NGS: PacBio RS• SMRT: Single Molecule Real time DNA synthesis.
• Single Molecule Sequencing: DNA synthesis is detected on a single DNA strand.
– Up to 15,000 nt, 50 bases/second
– DNA polymerase is affixed to the bottom of a tiny hole (~70nm).
– Only the bottom portion of the hole is illuminated allowing for detection of incorporation of dye-labeled nucleotide.
– Real-time Sequencing.
– DNA template is circularized by the use of “bell” shaped adapters.
– As long as the polymerase is stable this allows for continuous sequencing of both strands.
Advantages
• No amplification required.
• Extremely long read lengths.
• Average 2500 nt. Longest 15,000 nt.
Disadvantages
• High error rates.
• Error rate of ~15% for Indels. 1% Substitutions.
Most common applications of NGS
RNA-seq
/Transcriptomics
o Quantitative
o Descriptive
Alternative splicing
o miRNA profiling
Resequencing
o Mutation calling
o Profiling
oGenome annotation
De novo
sequencing
Copy number
variation
ChIP-seq /Epigenomics
o Protein-DNA interactions
o Active transcription factor binding
sites
oHistone methylationMetagenomics
Metatranscriptomics
Exome sequencing
Targeted sequencing
DNA sequencing - 1
• Whole GENOME Resequencing
– Need reference genome
– Variation discovery
• Whole GENOME “de novo” sequencing
– Uncharacterized genomes with no reference genome available
– known genomes where significant structural variation is expected.
– Long reads or mate-pair libraries. Sequencing mostly done by Roche 454, Illumina and PacBio.
– Assembly of reads is needed: Computational intensive
– E.g. Genome bacteria sequencing
DNA sequencing - 2• Targeted Resequencing
– Specific regions in the genome– Need reference genome– Need custom probes complementary
to the genomic regions• Nimblegen• Agilent
• Custom genes panel sequencing– Allows to cover high number of genes
related to a disease– Low cost and quicker than capillary
sequencing– E.g. Disease gene panel
• Whole EXOME Resequencing– Available for Human and Mouse– Variation discovery on ORFs
• 2% of human genome (lower cost)• 85% disease mutation are in the exome
Target Enrichment- Exome sequencing
DNA (patient)
Gene A Gene B
Produce shotgun
library
Capture exome
sequences
Wash & Sequence
Map against
reference genome
Determine
variants,
Annotate
and Filter
Candidate
mutations / genes
****
*
Don’t sequence all, just what you need
DNA sequencing - 3
• Amplicon sequencing
– Sequencing of regions amplified by PCR.
– Shorter regions to cover than targeted capture
– No need of custom probes
– Primer design is needed
– High fidelity polymerase
– Multiplexing is needed
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Genomics & Bioinformatics Platform of
Andalusia, GBPA
Edificio INSUR,
Albert Einstein Street.
Cartuja Scientific and Technology Park, Sevilla
• Platform based on Next Generation Sequencing technologies
• Genomics and Bioinformatics labs together
Genomics & Bioinformatics Platform of
Andalusia GBPA
Infrastructure at GBPA
SOLid 5500 XL
Roche 454 GS-FLX+
High performance cluster• 24 HPC nodes (72-192Gb)
• Hyperthreading: up to 450
parallel jobs
• Total memory: 2Tb
• Storage: 540 Tb
Infrastructure at GBPA
• Recently, GBPA got funding for:
MiSeq Illumina
HiSeq 2500
Illumina
PacBio RSII
Pacific Biosciences
Projects at GBPA
Medical Genome Project (MGP)• A first step for the implementation of the personalized
medicine in the Andalusian Health System
• The characterization of a number of genetic diseasesby means of exome sequencing.– Genetic rare diseases
– Monogenic diseases
• To characterize SNPs in a Spanish healthy controlpopulation– 300 Individuals
– More than 500.000 variants found. Half of them notpreviously reported in any public repository.
Other projects at GBPA• Development of an NGS data analysis system for the clinical
diagnosis of genetic diseases.
• Currently working on the development of High PerformanceComputing tools for the analysis of huge sets of variants, incollaboration with the EBI and CIPF.
• Other collaborations: IBIS, Hosp. San Cecilio, Hosp. SanJoan de Deu, Hosp. Clinic-IDIBAPS, Hosp. Ramón y Cajal,CIEMAT, UGR, CABIMER, etc.
• Participated in the Sequence Quality Control project (SEQC –MAQC III)
– The SEQC/MAQC-III Consortium. A Comprehensive assessment of RNA-seqaccuracy, reproducibility and information content by the Sequencing QualityControl Consortium. Nature Biotechnology, 32, pp.903-914, 2014
Services at GBPA
RNA-seq
/Transcriptomics
o Quantitative
o Descriptive
Alternative splicing
o miRNA profiling
Resequencing
o Mutation calling
o Profiling
oGenome annotation
De novo
sequencing
Copy number
variation
ChIP-seq /Epigenomics
o Protein-DNA interactions
o Active transcription factor binding
sites
oHistone methylationMetagenomics
Metatranscriptomics
Exome sequencing
Targeted sequencing
http://www.gbpa.es
http://www.gbpa.es
Training: 4-day hands-on course for the analysis of genomics / transcriptomics NGS data
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
NGS data pipeline analysis
DNA sample NGS instrument Data
Library preparation
Sequencing Data analysis
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
NGS data pipeline analysis
Quality control
Sequence filtering
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
RAW data
Propietary format
FastQ
Different sequencers output
different files (sff, csfasta, qual
file, xsq, …)
Nearly all downstream
analysis take FastQ as
input sequence
NGS instrument
Quality control. Data formats: FastQ
• Fastq format “ is a fasta with qualities”:
1. Header line (like fasta but starting with “@”)
2. Sequence (string of nucleotides)
3. “+” and sequence ID (optional)
4. Quality values of sequence encoded as a single byte ASCII code
• File extension: .fastq
• Sequence quality encoding
o Base quality must be encoded in just 1 byte!
o Each base has a corresponding quality value: quality in position n isrelated to base in position n
• Provide a natural framework for incorporating information (allelefrequency, Linkage Disequilibrium, etc)
SNV calling
• Probabilistic framework: Bayesian approach (used by GATK):
where:
D represents our data (read base pileup at this referencebase)
G represents the genotype under consideration.
p(G|D) is the posterior probability of genotype G
p(D|G) genotype likelihood
p(G) is the prior probability of seeing this genotype (SNPDBs, population sample, etc)
p(D) is constant over all genotypes (can be ignored)
p G p D Gp G D
p D
Indel calling
• Small insertions and deletions observed in thealignment of the read relative to the referencegenome• BAM format: I or D character in CIGAR denote indel in
the read
• Factor to consider when calling indels• Misalignment of the read
• Alignment score (often cheaper to introduce multipleSNVs than an indel)
• Sufficient flanking sequence either side of the read
• Length of the reads
Variant calling software
Nielsen R, et al., Genotype and SNP calling from next-generation sequencing data. Nature Reviews,
Genetics, 2011; 12: 443-451
Variant Calling Format (VCF)
Header
Data
Info fields
FORMAT fields:
GT: Genotype. For a single ALT:
0|0 – the sample is homozygous reference
0|1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
1|1 – the sample is homozygous alternate
GQ: Genotype Quality: phred-scaled confidence that the true genotype is the one provided in GT
DP: Approximate read depth
HQ: Haplotype Quality
Variant filtration and labelingGOAL
• Filter SNVs and Indels based on certain criteria, for example, using aset of expression derived from INFO fields: depth, mapping quality, etc.
• Variants that pass such criteria are labeled as PASS. Other wise, arelabeled to NOT_PASS (or whatever label we want to use)
• Filtering parameters for SNVs and indels are different