Aplicaciones de la secuenciación genómica de nueva ... · Aplicaciones de la secuenciación genómica de nueva generación ... DNA sequencing - 2 ... Variant calling

www.jornadasaludinvestiga.es

Aplicaciones de la secuenciación

genómica de nueva generación

(Next Generation Sequencing,

NGS) en estudios clínicos

Javier Pérez Florido

Antonio Rueda Martín

Bioinformaticians

Genomics and Bioinformatics Platform of Andalusia

(GBPA)

Agenda

• PART I→ Introduction to NGS technologies

→ About GBPA

• PART II→ NGS data analysis pipeline: from raw data to candidate variants

• PART III→ Full example: from raw to candidate variants

→ Successful stories

Agenda


→ About GBPA




Why NGS?

• It works!

• Versatility of the data

• Key: the development of a technology

able to massively parallelize the

sequencing process drastically

reduces sequencing time and costs

History of DNA Sequencing

Nature 458, 719-724 (2009)

Basics of the “new” Technology→ Get DNA.

→ Fragment de DNA and attach adaptors.

→ Attach it to something (bead or glass)

→ Extend and amplify signal with some color scheme.

→ Detect fluorochrome by microscopy.

→ Interpret series of spots as short strings of DNA.

→ Simultaneously sequencing entire libraries of DNA sequence fragments.

NGS Technologies

Differences among sequencing platforms

• Nanotechnology used.

• Detection system

• Resolution of the image analysis.

• Chemistry and enzymology.

• Read length and number of reads

• Signal to noise detection in the software (Q scores)

• Run time

• Cost

Roche 454 Pyrosequencing

M.L. Metzker, Nature Review Genetics(2010)

Roche 454 GS Systems

GS FLX+

GS Junior

• 10 h. Sequencing

• Avg read lenght 400 bp

• Reads per run 100,000

• 40 Mbp

• 23 h. Sequencing

• Avg read lenght 700 bp

• Reads per run 1,000,000

• 700 Mbp

Illumina: sequencing by synthesis

• DNA fragments are ligated at both ends to

adapters

• DNA fragments are immobilized at one

end on a solid support

• Single-stranded fragments create a

“bridge” structure

• Adapters act as primers for PCR

amplification

From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069

Illumina: PCR bridge amplification, reversible terminators

• Four reversible terminator nucleotides, each

labelled with a different fluorescent dye

• Incorporation of each nucleotide is detected by a

CCD camera

• Terminators are removed and synthesis repeated

• Just one nucleotide is incorporated in each cycle

Illumina Sequencers

HiSeq 2500

MiSeq

Max Output

1,000 GbMax Read Number

4,000 M

Max Read Length

2x125 bp

www.illumina.com

Max Output

15 Gb

Max Read Number

25 MMax Read Length

2x300 bp

Max Output

120 Gb

Max Read Number

400 MMax Read Length

2x150 bp

NextSeq 500

Solid 5500

200 Gb/run

35-75 bp fragments

1.8 - 4.8 billion reads/run

2x6 lanes/run

96 bar-codes

ECC: 99.99% accuracy

Colorspace reads

Third Generation NGS: PacBio RS• SMRT: Single Molecule Real time DNA synthesis.

• Single Molecule Sequencing: DNA synthesis is detected on a single DNA strand.

– Up to 15,000 nt, 50 bases/second

– DNA polymerase is affixed to the bottom of a tiny hole (~70nm).

– Only the bottom portion of the hole is illuminated allowing for detection of incorporation of dye-labeled nucleotide.

– Real-time Sequencing.

– DNA template is circularized by the use of “bell” shaped adapters.

– As long as the polymerase is stable this allows for continuous sequencing of both strands.

Advantages

• No amplification required.

• Extremely long read lengths.

• Average 2500 nt. Longest 15,000 nt.

Disadvantages

• High error rates.

• Error rate of ~15% for Indels. 1% Substitutions.

Most common applications of NGS

RNA-seq

/Transcriptomics

o Quantitative

o Descriptive

Alternative splicing

o miRNA profiling

Resequencing

o Mutation calling

o Profiling

oGenome annotation

De novo

sequencing

Copy number

variation

ChIP-seq /Epigenomics

o Protein-DNA interactions

o Active transcription factor binding

sites

oHistone methylationMetagenomics

Metatranscriptomics

Exome sequencing

Targeted sequencing

DNA sequencing - 1

• Whole GENOME Resequencing

– Need reference genome

– Variation discovery

• Whole GENOME “de novo” sequencing

– Uncharacterized genomes with no reference genome available

– known genomes where significant structural variation is expected.

– Long reads or mate-pair libraries. Sequencing mostly done by Roche 454, Illumina and PacBio.

– Assembly of reads is needed: Computational intensive

– E.g. Genome bacteria sequencing

DNA sequencing - 2• Targeted Resequencing

– Specific regions in the genome– Need reference genome– Need custom probes complementary

to the genomic regions• Nimblegen• Agilent

• Custom genes panel sequencing– Allows to cover high number of genes

related to a disease– Low cost and quicker than capillary

sequencing– E.g. Disease gene panel

• Whole EXOME Resequencing– Available for Human and Mouse– Variation discovery on ORFs

• 2% of human genome (lower cost)• 85% disease mutation are in the exome

Target Enrichment- Exome sequencing

DNA (patient)

Gene A Gene B

Produce shotgun

library

Capture exome

sequences

Wash & Sequence

Map against

reference genome

Determine

variants,

Annotate

and Filter

Candidate

mutations / genes

****

*

Don’t sequence all, just what you need

DNA sequencing - 3

• Amplicon sequencing

– Sequencing of regions amplified by PCR.

– Shorter regions to cover than targeted capture

– No need of custom probes

– Primer design is needed

– High fidelity polymerase

– Multiplexing is needed

Agenda


→ About GBPA




Genomics & Bioinformatics Platform of

Andalusia, GBPA

Edificio INSUR,

Albert Einstein Street.

Cartuja Scientific and Technology Park, Sevilla

• Platform based on Next Generation Sequencing technologies

• Genomics and Bioinformatics labs together

Genomics & Bioinformatics Platform of

Andalusia GBPA

Infrastructure at GBPA

SOLid 5500 XL

Roche 454 GS-FLX+

High performance cluster• 24 HPC nodes (72-192Gb)

• Hyperthreading: up to 450

parallel jobs

• Total memory: 2Tb

• Storage: 540 Tb

Infrastructure at GBPA

• Recently, GBPA got funding for:

MiSeq Illumina

HiSeq 2500

Illumina

PacBio RSII

Pacific Biosciences

Projects at GBPA

Medical Genome Project (MGP)• A first step for the implementation of the personalized

medicine in the Andalusian Health System

• The characterization of a number of genetic diseasesby means of exome sequencing.– Genetic rare diseases

– Monogenic diseases

• To characterize SNPs in a Spanish healthy controlpopulation– 300 Individuals

– More than 500.000 variants found. Half of them notpreviously reported in any public repository.

Other projects at GBPA• Development of an NGS data analysis system for the clinical

diagnosis of genetic diseases.

• Currently working on the development of High PerformanceComputing tools for the analysis of huge sets of variants, incollaboration with the EBI and CIPF.

• Other collaborations: IBIS, Hosp. San Cecilio, Hosp. SanJoan de Deu, Hosp. Clinic-IDIBAPS, Hosp. Ramón y Cajal,CIEMAT, UGR, CABIMER, etc.

• Participated in the Sequence Quality Control project (SEQC –MAQC III)

– The SEQC/MAQC-III Consortium. A Comprehensive assessment of RNA-seqaccuracy, reproducibility and information content by the Sequencing QualityControl Consortium. Nature Biotechnology, 32, pp.903-914, 2014

Services at GBPA

RNA-seq

/Transcriptomics

o Quantitative

o Descriptive

Alternative splicing

o miRNA profiling

Resequencing

o Mutation calling

o Profiling

oGenome annotation

De novo

sequencing

Copy number

variation

ChIP-seq /Epigenomics

o Protein-DNA interactions

o Active transcription factor binding

sites

oHistone methylationMetagenomics

Metatranscriptomics

Exome sequencing

Targeted sequencing

http://www.gbpa.es

http://www.gbpa.es

Training: 4-day hands-on course for the analysis of genomics / transcriptomics NGS data

Agenda


→ About GBPA




NGS data pipeline analysis

DNA sample NGS instrument Data

Library preparation

Sequencing Data analysis


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering


Quality control

Sequence filtering

Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

RAW data

Propietary format

FastQ

Different sequencers output

different files (sff, csfasta, qual

file, xsq, …)

Nearly all downstream

analysis take FastQ as

input sequence

NGS instrument

Quality control. Data formats: FastQ

• Fastq format “ is a fasta with qualities”:

1. Header line (like fasta but starting with “@”)

2. Sequence (string of nucleotides)

3. “+” and sequence ID (optional)

4. Quality values of sequence encoded as a single byte ASCII code

• File extension: .fastq

• Sequence quality encoding

o Base quality must be encoded in just 1 byte!

o Each base has a corresponding quality value: quality in position n isrelated to base in position n

o Encoding procedure:

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Error probability Phred transformation

(inversed integer value)ASCII encoding


• Phred + 33

o Sanger [0,40], Illumina 1.8 [0,41], llumina 1.9 [0,41]

• Phred + 64

o Illumina 1.3 [0,40], Illumina 1.5 [3,40]

Prob. of

incorrect

base call

Phred

quality

Score

Base

call

accuracy

1 in 10 10 90%

1 in 100 20 99%

1 in 1000 30 99.9%

1 in 10000 40 99.99%

1 in 100000 50 99.999%

Error probability Phred transformation

(inversed integer value)ASCII encoding


Quality Control

• Evaluation of sequence quality

o Primary tool to assess sequencing

o Evaluating sequences in depth is a valuableapproach to assess how reliable our results willbe

o QC determines posterior filtering

o Any filtering decision will affect downstream analysis

o QC must be run after every critical step

o Huge files… don’t worry. Software tools will do itfor us.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Quality ControlFastq file

FastQC

Quality control Addressing QC with FASTQC

• By means of FASTQC, raw reads can be evaluated in terms of differentquality metrics:

o Per base sequence quality

o Per sequence quality scores

o Per base sequence content

o Per base GC content

o Per sequence GC content

o Per base N content

o Sequence length distribution

o Duplicate sequences

o Overrepresented sequences

o Overrepresented k-mers

• Examples• Good quality

• Bad quality

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_sh

ort_fastqc/fastqc_report.html

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fast

qc/fastqc_report.html

Good

quality

Reasonable

quality

Poor quality

Shows an overview of the range of quality values across all bases at each position in the fastq file

• The central red line is the median value

• The yellow box represents the inter-quartile range (25-75%)

• The upper and lower whiskers represent the 10% and 90% points

• The blue line represents the mean quality

• Good data

• Consistent

• High quality along

the read

Per base sequence quality

Per base sequence quality

• Bad data

• High variance

• Qualitydecreasestowards the endof the read

Good

Reasonable

Poor

Per sequence quality scores

• Good data

• Most of the reads are high-quality

sequences

• Bad data

• Distribution with bi-modalities

Low quality reads

Allows to see if a subset of sequences have universally low quality values

Sequence length distribution

• Some sequencers output reads of

different length (for example,

Roche 454)

• Some sequencers generate

sequence fragments of uniform

length

Sequence Filtering

• It is important to remove bad quality data -> our

confidence on downstream analysis will be improved

Sequence Filtering

• Sequence

filtering:

o Mean quality

o Read length

o Read length after

trimming

o Percentage of

bases above a

quality threshold

o Adapter trimming

o Adapter reads

Minimum quality threshold

Sequence Filtering

• Sequence filtering tools

o Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)

o Galaxy (https://main.g2.bx.psu.edu/)

o SeqTK (https://github.com/lh3/seqtk)

o Cutadapt (https://code.google.com/p/cutadapt/)

o Trimmomatic

(http://www.usadellab.org/cms/?page=trimmomatic)

o …


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

The mapping process

The mapping process

• Reference Genome

• Consensus sequence, built up from high qualitysequencing samples

• Control reference sequence to compare oursamples

• Genome Reference Consortium: created todeliver assemblies:• http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

• Fasta format

• Different assemblies

NGS data

• DNA-seq, RNA-seq, BS-seq, ChIP-seq, …

• Reads sizes ranging from 75bp – 20kbp

• Single-end and paired-end reads

• Basespace or colorspace

NGS data

• Challenges

o Massive Data:

• Solid 5500W: 240Gb in 1x75bp reads, 320 Gb

in 2x50bp

• Illumina HiSeq 2500: 160Gb in 2x150bp reads

o Natural variability: SNPs, indels, de novo

mutations, CNVs…

o Sequencing errors

o RNA-seq: gapped alignment

o Computing resources

Mapping process considerations

• Which aligner should I use?

o Read length

o DNA or RNA

o Basespace or colorspace

o Computing resources

• Aligner parameters

o Single-end or paired-end

o SNVs, Indels

o Read quality

o Should allow multiple hits?

• Smith-Waterman (SW)• Align any two sequences

• Too slow for NGS and very high memory footprint

• Based on Hashes• Faster than SW

• High memory footprint

• Burrows Wheeler Transform• Very Fast and low memory footprint

• Very sensitive to errors

• Hybrid approaches

Mapping process: algorithms

• BWA, BWA-SW and BWA-MEM

• Based on Burrows-Wheeler Aligner

• http://bio-bwa.sourceforge.net/

• Widely used: support many read lenghts, valid for Illumina, 454, etc

• Bowtie and Bowtie2

• Based on Burrows-Wheeler Transform (BWT) algorithm

• Allowed a few mismatches and no gaps, claimed to the fastest

• http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

• HPG Aligner

• Great speed and sensitivity

• DNA or RNA

• HPC technologies used to provide the fastest runtime: multicore, SSE, GPUs

• Valid for short and long reads

• http://wiki.opencb.org/projects/hpg/doku.php?id=aligner:overview

• Other tools

• DNA: Bfast, Shrimp & Shrimp2, Blat, Mosaik-aligner, NextGenMap

• RNA-Seq: TopHat (uses bowtie) & TopHat2 (uses bowtie2), SOAPSplice

• BS-seq: Bismark (uses bowtie2), BRAT & BRAT-BW

Mapping process: tools

Mapping output: SAM / BAM format

• http://samtools.github.io/hts-specs/SAMv1.pdf

Mapping output: SAM / BAM formatHeader

Alignment

• BAM format• BAM format is the binary (compressed) representation of a

SAM file

• A BAM file is smaller than its corresponding SAM file, and canbe read faster, but the content is the same

• BAM index• Indexing a BAM file allows to access the alignments by

overlapping an specified region without going through thewhole alignments

• BAM index file: .bai

• The BAM file must be sorted by coordinate to be indexed

• Tools• Samtools, http://samtools.sourceforge.net

• Provides several utilities for manipulating alignments: SAM toBAM, sorting, BAM index, etc

• Others: Picard, Pysam, Bio-Sam tools

Mapping output: SAM / BAM format

Mapping procedure

Choose aligner

SAM to BAM

Choose a valid

reference

Choose aligner params

Sort BAM

Index BAM

QC

Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

Mapping procedure: can the results be visualized?

• Yes! Use IGV: The Integrative Genomics Viewer

• It is an integrated visualization tool of large data types

• http://www.broadinstitute.org/igv/


• IGV supports multiple file formats (not only BAM!)


RefSeq track

Zoom in to focus on an exon

Visualizing variants


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

What is variant calling?

Variant calling identifies variable sites (i.e. sites different to the reference)



• Variant types

o SNV: single nucleotide

variant

o Indel: small

insertion/deletion variant


• NGS data can suffer from high error rateso Base-calling, alignment errors

• Accurate variant calling is difficulto There is often considerable uncertainty associated

with the results.

• It is crucial to quantify and account for thisuncertaintyo it influences downstream analyses based on the

inferred SNVs (identification of rare mutations,estimation of allele frequencies, etc)

Mapping

Mark duplicates

Indel realigment

Base quality recalibration

Variant calling

Filtering and labeling

Variant annotation

PRE

POS

Variant calling pipeline based on GATK

M. DePristo et al. A framework for variation discovery and genotyping using

next-generation DNA sequencing data. Nature Genetics.43:491-498, 2011

Sequence processing

Mapping

Variant annotation

Candidate variants

Custom filtering

Variant calling


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

SNV calling

Indel calling

SNV calling

• SNV: single nucleotide variant• Examine the bases aligned to position and look for differences

• Two steps:• Variant calling: aims to determine in which positions with at least one

of the bases differs from reference.

• Genotype calling: process of determining the genotype for eachindividual for positions in which a variant has already been called

• Early methods:• Counting alleles at each site and using simple cutoff rules for when to

call a SNV or genotype

• Probabilistic frameworks:• Compute genotype likelihood

• Advantages:

• Provide statistical measures of uncertainty

• Lead to higher accuracy on genotype calling

• Provide a natural framework for incorporating information (allelefrequency, Linkage Disequilibrium, etc)

SNV calling

• Probabilistic framework: Bayesian approach (used by GATK):

where:

D represents our data (read base pileup at this referencebase)

G represents the genotype under consideration.

p(G|D) is the posterior probability of genotype G

p(D|G) genotype likelihood

p(G) is the prior probability of seeing this genotype (SNPDBs, population sample, etc)

p(D) is constant over all genotypes (can be ignored)

p G p D Gp G D

p D

Indel calling

• Small insertions and deletions observed in thealignment of the read relative to the referencegenome• BAM format: I or D character in CIGAR denote indel in

the read

• Factor to consider when calling indels• Misalignment of the read

• Alignment score (often cheaper to introduce multipleSNVs than an indel)

• Sufficient flanking sequence either side of the read

• Length of the reads

Variant calling software

Nielsen R, et al., Genotype and SNP calling from next-generation sequencing data. Nature Reviews,

Genetics, 2011; 12: 443-451

Variant Calling Format (VCF)

Header

Data

Info fields

FORMAT fields:

GT: Genotype. For a single ALT:

0|0 – the sample is homozygous reference

0|1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles

1|1 – the sample is homozygous alternate

GQ: Genotype Quality: phred-scaled confidence that the true genotype is the one provided in GT

DP: Approximate read depth

HQ: Haplotype Quality

Variant filtration and labelingGOAL

• Filter SNVs and Indels based on certain criteria, for example, using aset of expression derived from INFO fields: depth, mapping quality, etc.

• Variants that pass such criteria are labeled as PASS. Other wise, arelabeled to NOT_PASS (or whatever label we want to use)

• Filtering parameters for SNVs and indels are different

http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

SNVs

QD < 2.0

MQ < 40.0

FS > 60.0

HaplotypeScore > 13.0

MQRankSum < -12.5

ReadPosRankSum < -8.0

Indels

QD < 2.0

FS > 200.0

ReadPosRankSum < -20.0


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

Why annotation?

• Each healthy person carriesin the exome:• Aprox. 11,000 synonymous

variants

• Aprox. 11,000 non-synonymousvariants

• From 250 to 300 loss-of-functionvariants in annotated genes

• From 50 to 100 variantspreviously implicate in inheriteddisorders

Why annotation?

• Each healthy person carriesin the exome:• Aprox. 11,000 synonymous

variants

• Aprox. 11,000 non-synonymousvariants

• From 250 to 300 loss-of-functionvariants in annotated genes

• From 50 to 100 variantspreviously implicate in inheriteddisorders

• Freeman-Sheldon syndrome

• Only two variants are the true

disease causal mutations

Why annotation?

GOAL:To identify a small subset of functionally

important variants from large amounts of

sequencing data to pinpoint potential disease

causal genes and causal mutations

P.Cordero et al., Whole-Genome sequencing in personalized therapeutics, Clinical Pharmacology &

Therapeutics, vol.91(6): 1001-1009, 2012

G.M.Cooper et al., Needles in stacks of needles: finding disease-causal variants in a wealth of

genomic data. Nature Genetics, vol. 12:628-640, 2011

K.Wang et al. , ANNOVAR: functional annotation of genetic variants from high-throughput sequencing

data, Nucleic Acids Research, vol.38(16), e164, 2010

Why annotation?

G.M.Cooper et al., Needles in stacks of needles: finding disease-causal variants in a wealth of genomic

data. Nature Genetics, vol. 12:628-640, 2011

Needles in stacks of needles

Annotation levels

Annotation levels

• Genomic localization

Some DBs of functional information

Annotation software

• ANNOVARo http://www.openbioinformatics.org/annovaro Local annotation of SNVs and indels

o DBs: dbSNP, 1000g, regulatory information…

o Prediction: SIFT, Polyphen, Mutation Taster

o Species: human, mouse, worm, fly, yeast

• Variant Effect Predictor (VEP)o http://www.ensembl.org/info/docs/tools/vep/index.html

o Can annotate SNVs, Indels and complex variants

o Prediction: SIFT, Polyphen

o Many species

• HPG varianto http://docs.bioinfo.cipf.es/projects/hpg-variant

o Can annotate SNVs and Indels

o Huge amount of DBs available: HGMD, 1000g, dbSNP, regulatoryinformation…

o 11 species (human, mouse, work, fly, yeast,…)


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

199 200

201 249

Custom filtering

• The set of annotated variants can be huge.• Filtering is needed, for example, based on:

• Genotype according to pedigree

• Variant type: synonymous, nonsynonymous, stoploss,stopgain…

• Population frequency

• Conservation

• Disease information

• Pathway or ontologies

• …

It depends on the hypothesis!!!


Sequence processing

Mapping

Variant calling

Variant annotation

Candidate variants

Custom filtering

Agenda


→ About GBPA





Sequence processing (FASTQ)

Mapping (BAM)

Variant calling (VCF)

Variant annotation (VCF)

Candidate

variants (XLS,

PDF, HTML…)

Custom filtering (VCF)

From raw to candidate variants

FAMILY-21 STRATEGY

• DNA Capture: Custom capture to target

chromosome 21 exons.

• Multiplexed Illumina NextSeq sequencing

• Read Quality control

• Read Mapping using BWA

• Variant Calling using GATK

• Variant quality filtration using GATK

•Phenotype: Muscular Dystrophy

•Monogenic Disease

•Inheritance: Autosomal recessive

•Linked to chromosome 21

•Consanguinity

199 200

201 249


Raw Variants 1078 Variants


Raw Variants

Annovar Annotation

1078 Variants


Raw Variants

Annovar Annotation

Filter by Variant Type

1078 Variants


Raw Variants

Annovar Annotation

Filter by Variant Type Non Synonymous Variants

1078 Variants

155 Variants


Raw Variants

Annovar Annotation


1078 Variants

155 Variants

Filter by Pedigree 4 Variants


Raw Variants

Annovar Annotation


1078 Variants

155 Variants


Full Annotation 1 Candidate


Raw Variants

Experimental

Validation

Annovar Annotation


1078 Variants

155 Variants


Full Annotation 1 Candidate

Agenda


→ About GBPA




Successful stories

• Whole Genome Example

Successful stories

• Whole Exome Example

Successful stories

• Targeted Sequencing Example

Gracias por su atención.

Aplicaciones de la secuenciación genómica de nueva ... · Aplicaciones de la secuenciación genómica de nueva generación ... DNA sequencing - 2 ... Variant calling

Documents