Dealing with ‘raw reads’ Analysis of Next-Generation Sequencing Data Friederike Dündar Applied Bioinformatics Core Slides at https://bit.ly/2T3sjRg 1 January 28, 2020 1 https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/ F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 1 / 43
51
Embed
Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data · 2020-04-14 · Dealing with `raw reads' - Analysis of Next-Generation Sequencing Data Author: Friederike
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dealing with ‘raw reads’Analysis of Next-Generation Sequencing Data
Friederike Dündar
Applied Bioinformatics Core
Slides at https://bit.ly/2T3sjRg1
January 28, 2020
1https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 1 / 43
1 Fluorescence-based microscopy
2 Single and paired-end reads
3 Illumina’s “raw reads”
4 Quality control of sequencing reads
5 Sequence Read Archive
6 References
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 2 / 43
Fluorescence-based microscopy
Fluorescence-based microscopy
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 3 / 43
Fluorescence-based microscopy
Re-cap: Sequencing by synthesis after library preparation
The number of sequencing cycles2 determines the read length.
Fluorophores: molecules that re-emit lightupon absorption of light
Fluorescence microscopes separate emittedlight (dim) from excitation light (bright).
See Sanderson et al. [2014] for an overview of fluorescence microscropytechniques (not just DNA-sequencing-related).
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 5 / 43
Single and paired-end reads
Single and paired-end reads
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 6 / 43
Single and paired-end reads
Types of reads
Single reads are cheaper.(why?)Paired-end (PE) reads arehelpful for:
alignment along repetitiveregions
chromosomalrearrangements and genefusion detection
de novo genome andtranscriptome assembly
precise information aboutthe size of the originalfragment (insert size)
PCR duplicate identification
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 7 / 43
Single and paired-end reads
Paired-end read generation
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 8 / 43
Single and paired-end reads
Paired-end read generation
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 9 / 43
Single and paired-end reads
Paired-end read generation
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 10 / 43
Single and paired-end reads
Paired-end read generation
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 11 / 43
Illumina’s “raw reads”
Illumina’s “raw reads”
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 12 / 43
Illumina’s “raw reads”
Illumina’s read output: turning images into text files
TIFF BCL file
basecall files (binary textfiles)
during sequencing, basecalls for every location ofthe flowcell are added livefor every cycle
FASTQ files
base calls are gathered perread rather than per cycle
reads are sorted into dif-ferent files per sample asidentified by the barcodes(demultiplexing)
All steps here are performed by Illumina’s proprietory CASAVA software.The file name usually includes some information about the sample:<sample name>_<barcode sequence>_<L(lane)>_<R(read number)>_<setnumber>.fastq.gz, e.g. MyExperiment_AGCTTGTTC_L001_R1_001.fastq.gz
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 13 / 43
Illumina’s “raw reads”
The FASTQ format: FASTA + quality score
1 read = 4 lines
1 @Read ID and sequencing run information2 sequence3 + (additional description possible; usually an emptyline)
4 quality scores
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 14 / 43
Illumina’s “raw reads”
The read ID line is standardized by Casava 1.8
CAUTION
This will only betrue if you receiveFASTQ files freshoff the sequencer.If you downloadFASTQ files frompublic repositories,the read ID mighthave been changedsignificantly.
see https://en.wikipedia.org/wiki/FASTQ_format
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 15 / 43
The quality scores: summarizing numerical scores intosingle-character representations
Illumina’s CASAVA pipeline:BCL files: Base calls (A/C/T/G) are immediately recorded with an error
probability3.The error probabilites are translated into ASCII symbols in the FASTQ files.
3See the QC section for reasons for base call uncertainties.F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 16 / 43
Illumina’s “raw reads”
ASCII symbols
www.ascii-code.com
ASCII encodes 128 specifiedcharacters into seven-bit integers,which is useful for digitalcommunication.
The first 33 characters representunprintable control codes (e.g. “Startof Text"), therefore the Phred scoreswere originally encoded by using anoffset of +33 (Rightarrow “!").
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 17 / 43
Illumina’s “raw reads”
Printable ASCII symbols start at 33
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 18 / 43
Illumina’s “raw reads”
Different offsets have been used by different Casavaversions
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 19 / 43
Illumina’s “raw reads”
Different offsets have been used by different Casavaversions
Both the range of the base call score as well as its translation via theASCII code (offset) are somewhat arbitrary and have undergone numerouschanges.
Today’s standard:
min. score: 0
max. score: 41
ASCII offset: 33
Make sure you know which version you’re dealing with.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 20 / 43
Quality control of sequencing reads
Quality control of sequencing reads
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 21 / 43
Quality control of sequencing reads
Two basic QC questions
1 Did our library prep generate a faithful representation of theDNA/RNA molecules our our samples?
I ideally, the entire universe of nucleotides was captured (diverse library)I no contaminationsI no degradationI no bias towards fragments of certain GC contents and/or sizes
2 How successful was the actual sequencing?I consistently high base call confidenceI uniform nucleotide frequencies
Biases
QC should help identify systematic distortions of data and theirpossible sources.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 22 / 43
unpublished, but most widely used QC toolsupports all NGS technologiescontinuously developed and maintained by long-time bioinformaticsexpertswill only use the first 200K reads for the diagnosis!
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 23 / 43
Sequencing qualityBased on ASCII-endoced Phred scores within the fastq file.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 24 / 43
Quality control of sequencing reads
Sequencing quality
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 25 / 43
Quality control of sequencing reads
Sequencing quality: reasons for sequencing noise
Noise = fluorophore intensity signal is not as strong and clear asexpected.
laser not well calibratedinterfering signals from neighbouring clusters or bases with similaremission spectraunsynchronized fragments in each cluster:
I phasing: small fraction of fragments in each cluster fails to incorporateany base
I prephasing: more than one base is incorporateddecaying chemicals (runs often last several days to a week!)extraneous objects on the flow cell (e.g. dust, air bubbles)
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 26 / 43
Quality control of sequencing reads
Physically localized error rates: tiles vs. time
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 27 / 43
Quality control of sequencing reads
Physically localized error rates: tiles vs. time
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 28 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Contaminations: threats to the full representation of ouroriginal fragment pool (and waste of seq. reads)
Sources:I primer contaminationI adapter contamination
sequence read length larger than the fragment size (3’ contamination)adapter dimers without insert
I DNA from other species/librariesConsequences:
I noiseI reduced alignment rates
Can be identified by examining sequence composition andoverrepresented sequences/k-mers.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 29 / 43
Quality control of sequencing reads
Detecting contaminations
Per Base Sequence Content
If the fragments representa random and diverserepresentation of the entiregenome, there should be a**uniform distribution** ofall four bases across allcycles.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 30 / 43
Quality control of sequencing reads
Detecting contaminations
Per Base Sequence Content – more examples
irregularities in the first ca. 8 bp are often seen for RNA-seq and ATAC-seq andindicate a bias for certain sequences at the fragment beginningmore severe deviations from uniformity often indicate contaminations and/or lack oflibrary diversity
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 31 / 43
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 32 / 43
Quality control of sequencing reads
Trimming contaminations & low-quality bases
Mostly done to improve alignment.Can be done before alignment or, if contaminations/low-quality basesare low in number, might be left to the “soft-clipping” function4 ofread aligners.There are numerous tools out there to do the job, e.g. Cutadapt[Martin, 2011] and TrimGalore.For de novo assemblies, it is probably more meaningful to performsome error-correction based on overlapping reads rather than trimmingthe reads [Salzberg et al., 2012, Yang et al., 2013]
4ignoring mis-matched bases at the beginning/end of a readF. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 33 / 43
optical duplicates (same DNA cluster erraneouslyreported as separate clusters)natural duplicates (multiple independent originalfragments with very similar sequence)
I more likely to occur for small(ish)genomes/transcriptomes and experiments thatenrich for relatively few and small regions ofthe genome
PCR duplicates (1 original fragment)I often sample-specific and very difficult to
correct in silicoI can be reduced by avoiding excessive PCR
The Problem
There is no way to distinguish natural from PCR duplicates!
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 34 / 43
Quality control of sequencing reads
Duplicate reads: FastQC assessmentProportion of reads (y-axis) that contain sequences in each of the differentduplication level bins (x-axis).
Blue line: all reads (=first 100K!) – how many
times are individualsequences found?
Red line: sequences afterde-deduplication – howmany different sequences
were found to beduplicated?
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 35 / 43
Quality control of sequencing reads
Duplicate reads: FastQC assessment
Check that the red line is flat and that the number of remaining reads afterde-duplication is acceptable.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 36 / 43
Quality control of sequencing reads
Two basic QC questions
1 Did our library prep generate a faithful representation of theDNA/RNA molecules our our samples?
I ideally, the entire universe of nucleotides was captured (diverse library)I no contaminationsI no bias towards fragments of certain GC contents and/or sizesI no degradation
2 How successful was the actual sequencing?I consistently high base call confidenceI uniform nucleotide frequencies
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 37 / 43
Quality control of sequencing reads
QC summary
Figure from Zhou and Rokas [2014] (highly recommended reading!)F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 38 / 43
Sequence Read Archive
Sequence Read Archive
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 39 / 43
Sequence Read Archive
Where are all the reads?SRA = main repository for publicly available DNA and RNA sequencing data of which
three instances are maintained world-wide.GEO (https://www.ncbi.nlm.nih.gov/geo/) can be used to find SRA data, too.
See O’Sullivan et al. [2017] for many more details.F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 40 / 43
Christopher O’Sullivan, Benjamin Busby, and Ilene Karsch Mizrachi. InJonathan M. Keith, editor, Bioinformatics: Volume I: Data, SequenceAnalysis, and Evolution, chapter Managing Sequence Data. HumanaPress, 2017. doi: 10.1007/978-1-4939-6622-6_4.
Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, Daniela Puiu, TanjaMagoc, Sergey Koren, Todd J. Treangen, Michael C. Schatz, Arthur L.Delcher, Michael Roberts, Guillaume Marcxais, Mihai Pop, and James A.Yorke. GAGE: A critical evaluation of genome assemblies and assemblyalgorithms. Genome Research, 2012. doi: 10.1101/gr.131383.111.
Michael J. Sanderson, Ian Smith, Ian Parker, and Martin D. Bootman.Fluorescence microscopy. Cold Spring Harbor Protocols, 2014. doi:10.1101/pdb.top071795.
Xiao Yang, Sriram P. Chockalingam, and Srinivas Aluru. A survey oferror-correction methods for next-generation sequencing. Briefings inBioinformatics, 2013. doi: 10.1093/bib/bbs015.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 42 / 43
References
Xiaofan Zhou and Antonis Rokas. Prevention, diagnosis and treatment ofhigh-throughput sequencing data pathologies. Molecular Ecology, 23(7):1679–1700, 2014. doi: 10.1111/mec.12680.
F. Dündar (ABC, WCM) Dealing with ‘raw reads’ January 28, 2020 43 / 43