Top Banner
De-multiplexing & Quality Control Challenges and Solutions Sridhar Srinivasan Bioinformatician Premas Lifescience
23

De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Aug 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

De-multiplexing & Quality Control Challenges and Solutions

Sridhar Srinivasan

Bioinformatician

Premas Lifescience

Page 2: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Abstract

• Illumina has multiple sequencing platforms

– that produces large amount of high quality sequence data

in a short time frame.

• To utilize full potential of a sequencing run, we

generally multiplex many samples in to a run.

• This has to be followed by demultiplexing and

quality control of data to get reliable and

reproducible results.

Page 3: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Our Discussion Here Is About

• The tools and strategies to get good demultiplexed

data.

• Also includes how to check various quality

parameters of sequencing data and methods to get

rid of any low quality or contaminated reads.

Page 4: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Data analysis

Images Intensities

Reads Alignments Polymorphisms

Instrument Control Software/RTA

AT G

C

Basecalls

CASAVA 1.8 / MSR/ 3rd party tools

Visualize

Biological results

C/A/G/T

Bcl files

Page 5: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

CASAVA

• CASAVA is a Linux application designed to:

– Translate base calls (.bcl files) to compressed,

demultiplexed FASTQ files

– Align reads

– Call variants (SNPs and indels)

– Assign genotypes to variants

– Count expression level for exons, genes and splice

junctions in case of RNA-seq runs

Page 6: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing overview

• Demultiplexing can be done by:

– CASAVA 1.8.2

– MiSeq Reporter software (for the miSeq)

• Demultiplexing requires a run folder (with bcl files) and a sample sheet

• Demultiplexing occurs during Bcl to Fastqprocessing

• Each index sequence read is compared to the index

sequence specified in the sample sheet

• No quality values are considered in this step

Page 7: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

How does Demultiplexing occurs?

• Illumina sequencing instruments generate *.bcl files as primary

sequencing output.

• CASAVA contains a BCL to FASTQ

converter(configureBclToFastq.pl) that combines these per-cycle

*.bcl files from a run and translates them into FASTQ files.

• In addition to generating FASTQ files, CASAVA uses a user-created

or IEM sample sheet to divide the run output in projects and

samples, and stores these in separate directories.

• If no sample sheet is provided, all samples will be put in the

Undetermined_Indices directory by lane, and not demultiplexed

Page 8: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Samplesheet.csv

Header DescriptionFCID Positive integer indicating lane number (1-8)

SampleID ID of sample

SampleRef The reference sequence to be used for Sample

Index Index sequence

Description Description of the sample

Control Y indicates lane is control lane N means sample

Recipe Recipe used for sequencing

Operator Name or ID of operator

SampleProject The project the sample belongs to

Page 9: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Input Files for configureBclToFastq.pl

• Run Folder (from RTA or

OLB)

– Files actually required

are in the graphic shown

• SampleSheet.csv

– User created (Microsoft

Excel is easiest)

– Saved as *.csv format

– Default directory is in the

BaseCalls Directory

Page 10: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

BCL Conversion and Demultiplexing Invocation

• Create MakeFiles

– Builds the run folder structure and generates the MakeFiles

• cd into the Analysis Directory

– MakeFiles are created in the analysis directory

• Execute MakeFiles

– Start the BCL conversion and Demultiplexing run

Nohup command keep even if process interrupted or if you log out.

The -j option specifies the extent of parallelization

/path/to/CASAVA/bin/configureBclToFastq.pl --input-dir <BaseCalls_DIR>

--output-dir <Unaligned> --sample-sheet <Input DIR>/SampleSheet.csv

cd /path/to/RunFolder/Unaligned

nohup make –j <n> &

Command(s)

Page 11: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Bcl conversion and Demultiplexing

options

• Selected command line options

Page 12: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing output fastq file

• The fastq files are located in the

Unaligned/Project_<ProjectName>/Sample_<SampleNa

me> directories

• Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-

padded to 3digits)>_R<read number>_<set number (0-

padded to 3digits>.fastq.gz

• In the case of non-multiplexed runs, <sample name> will

be replaced with the lane numbers (lane1, lane2, ...,

lane8) and <barcode sequence> will be replaced with

"NoIndex".

Page 13: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing Output Files, FastQ File

@HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA

TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA

+

=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E

ASCIIValue

PhredScore

Error probability

Character

5 53 20 0.01

? 63 30 0.001

I 73 40 0.0001

Page 14: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplex stat file

Page 15: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing Output Files, Summary File

• The Demultiplex_Stats file is located in the Unaligned/Basecall_Stats_FCID

directory.

Page 16: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Troubleshooting indexes

• Linux command line to determine raw index sequence frequency

HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA

TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA

+

=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E

• Go to Undetermined_Indices/Sample_lane<n>

• Command line:

gunzip

| awk

| sort

-c lane1_Undetermined_L001_R1_001.fastq.gz \

'{if($2~/:/) {sub(/.*:/,"",$2); print $2}}'\

-n | uniq -c | sort -n -r > index.list.txt

Page 17: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

FastQC

• FastQC provide a simple way to do some quality control

checks on raw sequence data.

• give a quick impression of whether your data has any

problems of which you should be aware before doing any

further analysis.

• Main Functions

-- Import of data from BAM, SAM or FastQ files (any variant)

-- Providing a quick overview to tell which areas there may be

problems

-- Summary graphs and tables to quickly assess your data

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Page 18: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Quality check

Page 19: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

PRINSEQ

• PRINSEQ can be used to filter, reformat, or trim your genomic and

metagenomic sequence data.

• Fastq files as input

• Sequence data can be filtered to remove sequence copies, short or

long sequences, sequences with N's, low-quality sequences, and

much more.

http://prinseq.sourceforge.net/

/path/to/prinseq-lite –fastq -out_good out/“fastq_filt" -out_bad null -

trim_right 10 -ns_max_p 5 -lc_method dust -lc_threshold 10 -no_qual_header

Command(s)

Page 20: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Filtering

Before Trimming After Trimming

Page 21: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Adaptor Trimming

• Adaptor trimming done before downstream analysis

• If the read length is shorter than actual insert size, there

is no need to do trimming.

• --adaptor--masking .fasta file (CASAVA)

21

Page 22: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

QC and filtering softwares

• Other softwares performing either of these

or both:

-- cutadaptor

--Trim Galore

--Trimmomatic

--Sickle/scythe

-- Fastx Toolkit

22

Page 23: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Thank you!