De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

De-multiplexing & Quality Control Challenges and Solutions

Sridhar Srinivasan

Bioinformatician

Premas Lifescience

Abstract

• Illumina has multiple sequencing platforms

– that produces large amount of high quality sequence data

in a short time frame.

• To utilize full potential of a sequencing run, we

generally multiplex many samples in to a run.

• This has to be followed by demultiplexing and

quality control of data to get reliable and

reproducible results.

Our Discussion Here Is About

• The tools and strategies to get good demultiplexed

data.

• Also includes how to check various quality

parameters of sequencing data and methods to get

rid of any low quality or contaminated reads.

Data analysis

Images Intensities

Reads Alignments Polymorphisms

Instrument Control Software/RTA

AT G

C

Basecalls

CASAVA 1.8 / MSR/ 3rd party tools

Visualize

Biological results

C/A/G/T

Bcl files

CASAVA

• CASAVA is a Linux application designed to:

– Translate base calls (.bcl files) to compressed,

demultiplexed FASTQ files

– Align reads

– Call variants (SNPs and indels)

– Assign genotypes to variants

– Count expression level for exons, genes and splice

junctions in case of RNA-seq runs

Demultiplexing overview

• Demultiplexing can be done by:

– CASAVA 1.8.2

– MiSeq Reporter software (for the miSeq)

• Demultiplexing requires a run folder (with bcl files) and a sample sheet

• Demultiplexing occurs during Bcl to Fastqprocessing

• Each index sequence read is compared to the index

sequence specified in the sample sheet

• No quality values are considered in this step

How does Demultiplexing occurs?

• Illumina sequencing instruments generate *.bcl files as primary

sequencing output.

• CASAVA contains a BCL to FASTQ

converter(configureBclToFastq.pl) that combines these per-cycle

*.bcl files from a run and translates them into FASTQ files.

• In addition to generating FASTQ files, CASAVA uses a user-created

or IEM sample sheet to divide the run output in projects and

samples, and stores these in separate directories.

• If no sample sheet is provided, all samples will be put in the

Undetermined_Indices directory by lane, and not demultiplexed

Samplesheet.csv

Header DescriptionFCID Positive integer indicating lane number (1-8)

SampleID ID of sample

SampleRef The reference sequence to be used for Sample

Index Index sequence

Description Description of the sample

Control Y indicates lane is control lane N means sample

Recipe Recipe used for sequencing

Operator Name or ID of operator

SampleProject The project the sample belongs to

Input Files for configureBclToFastq.pl

• Run Folder (from RTA or

OLB)

– Files actually required

are in the graphic shown

• SampleSheet.csv

– User created (Microsoft

Excel is easiest)

– Saved as *.csv format

– Default directory is in the

BaseCalls Directory

BCL Conversion and Demultiplexing Invocation

• Create MakeFiles

– Builds the run folder structure and generates the MakeFiles

• cd into the Analysis Directory

– MakeFiles are created in the analysis directory

• Execute MakeFiles

– Start the BCL conversion and Demultiplexing run

Nohup command keep even if process interrupted or if you log out.

The -j option specifies the extent of parallelization

/path/to/CASAVA/bin/configureBclToFastq.pl --input-dir <BaseCalls_DIR>

--output-dir <Unaligned> --sample-sheet <Input DIR>/SampleSheet.csv

cd /path/to/RunFolder/Unaligned

nohup make –j <n> &

Command(s)

Bcl conversion and Demultiplexing

options

• Selected command line options

Demultiplexing output fastq file

• The fastq files are located in the

Unaligned/Project_<ProjectName>/Sample_<SampleNa

me> directories

• Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-

padded to 3digits)>_R<read number>_<set number (0-

padded to 3digits>.fastq.gz

• In the case of non-multiplexed runs, <sample name> will

be replaced with the lane numbers (lane1, lane2, ...,

lane8) and <barcode sequence> will be replaced with

"NoIndex".

Demultiplexing Output Files, FastQ File

@HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA

TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA

+

=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E

ASCIIValue

PhredScore

Error probability

Character

5 53 20 0.01

? 63 30 0.001

I 73 40 0.0001

Demultiplex stat file

Demultiplexing Output Files, Summary File

• The Demultiplex_Stats file is located in the Unaligned/Basecall_Stats_FCID

directory.

Troubleshooting indexes

• Linux command line to determine raw index sequence frequency

HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA

TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA

+

=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E

• Go to Undetermined_Indices/Sample_lane<n>

• Command line:

gunzip

| awk

| sort

-c lane1_Undetermined_L001_R1_001.fastq.gz \

'{if($2~/:/) {sub(/.*:/,"",$2); print $2}}'\

-n | uniq -c | sort -n -r > index.list.txt

FastQC

• FastQC provide a simple way to do some quality control

checks on raw sequence data.

• give a quick impression of whether your data has any

problems of which you should be aware before doing any

further analysis.

• Main Functions

-- Import of data from BAM, SAM or FastQ files (any variant)

-- Providing a quick overview to tell which areas there may be

problems

-- Summary graphs and tables to quickly assess your data

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/












Quality check

PRINSEQ

• PRINSEQ can be used to filter, reformat, or trim your genomic and

metagenomic sequence data.

• Fastq files as input

• Sequence data can be filtered to remove sequence copies, short or

long sequences, sequences with N's, low-quality sequences, and

much more.

http://prinseq.sourceforge.net/

/path/to/prinseq-lite –fastq -out_good out/“fastq_filt" -out_bad null -

trim_right 10 -ns_max_p 5 -lc_method dust -lc_threshold 10 -no_qual_header

Command(s)

http://prinseq.sourceforge.net/

Filtering

Before Trimming After Trimming

Adaptor Trimming

• Adaptor trimming done before downstream analysis

• If the read length is shorter than actual insert size, there

is no need to do trimming.

• --adaptor--masking .fasta file (CASAVA)

21

QC and filtering softwares

• Other softwares performing either of these

or both:

-- cutadaptor

--Trim Galore

--Trimmomatic

--Sickle/scythe

-- Fastx Toolkit

22

Thank you!

De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Documents