De-multiplexing & Quality Control Challenges and Solutions Sridhar Srinivasan Bioinformatician Premas Lifescience
De-multiplexing & Quality Control Challenges and Solutions
Sridhar Srinivasan
Bioinformatician
Premas Lifescience
Abstract
• Illumina has multiple sequencing platforms
– that produces large amount of high quality sequence data
in a short time frame.
• To utilize full potential of a sequencing run, we
generally multiplex many samples in to a run.
• This has to be followed by demultiplexing and
quality control of data to get reliable and
reproducible results.
Our Discussion Here Is About
• The tools and strategies to get good demultiplexed
data.
• Also includes how to check various quality
parameters of sequencing data and methods to get
rid of any low quality or contaminated reads.
Data analysis
Images Intensities
Reads Alignments Polymorphisms
Instrument Control Software/RTA
AT G
C
Basecalls
CASAVA 1.8 / MSR/ 3rd party tools
Visualize
Biological results
C/A/G/T
Bcl files
CASAVA
• CASAVA is a Linux application designed to:
– Translate base calls (.bcl files) to compressed,
demultiplexed FASTQ files
– Align reads
– Call variants (SNPs and indels)
– Assign genotypes to variants
– Count expression level for exons, genes and splice
junctions in case of RNA-seq runs
Demultiplexing overview
• Demultiplexing can be done by:
– CASAVA 1.8.2
– MiSeq Reporter software (for the miSeq)
• Demultiplexing requires a run folder (with bcl files) and a sample sheet
• Demultiplexing occurs during Bcl to Fastqprocessing
• Each index sequence read is compared to the index
sequence specified in the sample sheet
• No quality values are considered in this step
How does Demultiplexing occurs?
• Illumina sequencing instruments generate *.bcl files as primary
sequencing output.
• CASAVA contains a BCL to FASTQ
converter(configureBclToFastq.pl) that combines these per-cycle
*.bcl files from a run and translates them into FASTQ files.
• In addition to generating FASTQ files, CASAVA uses a user-created
or IEM sample sheet to divide the run output in projects and
samples, and stores these in separate directories.
• If no sample sheet is provided, all samples will be put in the
Undetermined_Indices directory by lane, and not demultiplexed
Samplesheet.csv
Header DescriptionFCID Positive integer indicating lane number (1-8)
SampleID ID of sample
SampleRef The reference sequence to be used for Sample
Index Index sequence
Description Description of the sample
Control Y indicates lane is control lane N means sample
Recipe Recipe used for sequencing
Operator Name or ID of operator
SampleProject The project the sample belongs to
Input Files for configureBclToFastq.pl
• Run Folder (from RTA or
OLB)
– Files actually required
are in the graphic shown
• SampleSheet.csv
– User created (Microsoft
Excel is easiest)
– Saved as *.csv format
– Default directory is in the
BaseCalls Directory
BCL Conversion and Demultiplexing Invocation
• Create MakeFiles
– Builds the run folder structure and generates the MakeFiles
• cd into the Analysis Directory
– MakeFiles are created in the analysis directory
• Execute MakeFiles
– Start the BCL conversion and Demultiplexing run
Nohup command keep even if process interrupted or if you log out.
The -j option specifies the extent of parallelization
/path/to/CASAVA/bin/configureBclToFastq.pl --input-dir <BaseCalls_DIR>
--output-dir <Unaligned> --sample-sheet <Input DIR>/SampleSheet.csv
cd /path/to/RunFolder/Unaligned
nohup make –j <n> &
Command(s)
Bcl conversion and Demultiplexing
options
• Selected command line options
Demultiplexing output fastq file
• The fastq files are located in the
Unaligned/Project_<ProjectName>/Sample_<SampleNa
me> directories
• Illumina FASTQ files use the following naming scheme:
<sample name>_<barcode sequence>_L<lane (0-
padded to 3digits)>_R<read number>_<set number (0-
padded to 3digits>.fastq.gz
• In the case of non-multiplexed runs, <sample name> will
be replaced with the lane numbers (lane1, lane2, ...,
lane8) and <barcode sequence> will be replaced with
"NoIndex".
Demultiplexing Output Files, FastQ File
@HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA
TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA
+
=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E
ASCIIValue
PhredScore
Error probability
Character
5 53 20 0.01
? 63 30 0.001
I 73 40 0.0001
Demultiplex stat file
Demultiplexing Output Files, Summary File
• The Demultiplex_Stats file is located in the Unaligned/Basecall_Stats_FCID
directory.
Troubleshooting indexes
• Linux command line to determine raw index sequence frequency
HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA
TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA
+
=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E
• Go to Undetermined_Indices/Sample_lane<n>
• Command line:
gunzip
| awk
| sort
-c lane1_Undetermined_L001_R1_001.fastq.gz \
'{if($2~/:/) {sub(/.*:/,"",$2); print $2}}'\
-n | uniq -c | sort -n -r > index.list.txt
FastQC
• FastQC provide a simple way to do some quality control
checks on raw sequence data.
• give a quick impression of whether your data has any
problems of which you should be aware before doing any
further analysis.
• Main Functions
-- Import of data from BAM, SAM or FastQ files (any variant)
-- Providing a quick overview to tell which areas there may be
problems
-- Summary graphs and tables to quickly assess your data
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Quality check
PRINSEQ
• PRINSEQ can be used to filter, reformat, or trim your genomic and
metagenomic sequence data.
• Fastq files as input
• Sequence data can be filtered to remove sequence copies, short or
long sequences, sequences with N's, low-quality sequences, and
much more.
http://prinseq.sourceforge.net/
/path/to/prinseq-lite –fastq -out_good out/“fastq_filt" -out_bad null -
trim_right 10 -ns_max_p 5 -lc_method dust -lc_threshold 10 -no_qual_header
Command(s)
Filtering
Before Trimming After Trimming
Adaptor Trimming
• Adaptor trimming done before downstream analysis
• If the read length is shorter than actual insert size, there
is no need to do trimming.
• --adaptor--masking .fasta file (CASAVA)
21
QC and filtering softwares
• Other softwares performing either of these
or both:
-- cutadaptor
--Trim Galore
--Trimmomatic
--Sickle/scythe
-- Fastx Toolkit
22
Thank you!