Top Banner
20

OUTLINE - DTU

Jan 16, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OUTLINE - DTU
Page 2: OUTLINE - DTU

OUTLINE• The main steps in NGS analysis

• Why is preprocessing important?

• Preprocessing • Fastqc reports

• Adapters

• K-mers

• Depth of coverage vs Breadth of coverage

• Merge paired end reads

• Ion Torrent data

• Exercises

Page 3: OUTLINE - DTU

MAIN STEPS IN NGS ANALYSISD

ATA

SI

ZE

Question Raw reads AnswerPre-process

Assembly:Alignment / de novo

Analysis Compare samples / methods

?

?

?

?

?

?

Page 4: OUTLINE - DTU

WHY IS PREPROCESSING IMPORTANT?

Quality?

Adapters?

Errors?

Sequencing depth?

Every base in a read have a quality scoreNote: bases are not always correct!

Different sequencing technologies has different error profiles.

How deep is the sample sequenced. How many times that your data covers the genome.Adapters/primers are non-

biological sequences that can be a part of the raw data.

Do we trust our data?

Page 5: OUTLINE - DTU

FASTQC REPORTS

• Report basic statistics on your data

• Identify issues with your data

Page 6: OUTLINE - DTU

PER BASE SEQUENCE QUALITY

Quality often decreases over the read.

Page 7: OUTLINE - DTU

AVERAGE QUALITY

Remove reads with a quality below 20.

Remove reads with ‘N’ base calls.

Page 8: OUTLINE - DTU

TRIM FROM 5’Sometimes something is fishy in the beginning of the read.

It is recommended to remove the first number of bases from the 5’.

How many bases would you remove in this case?

Page 9: OUTLINE - DTU

ADAPTERS

• Sometimes adapters / primers are also part

of the read

• Adapter / primers are non-biological

sequences

• The artificial repeats will disturb alignments

and de novo assembly

• The sequence is often known, if not, FastQC

may find them

Page 10: OUTLINE - DTU

ADAPTERS

We will use “Cutadapt” and “AdapterRemoval”, but other programs can also do the job.

Page 11: OUTLINE - DTU

K-MER CORRECTION

• Create a sliding window of size k, move it over all your reads and count

occurrence of k-mers

• We can use this to correct sequencing errors!

k=4 DNA: ACGTGTAACGTGACGTTGGA

ACGTCGTG

GTGTTGTA

Page 12: OUTLINE - DTU

K-MER CORRECTION

ACGTGGTTGCCCTTAAAACGTGGTTACCCTTAAA ACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAA

Concept: rare k-mers are sequencing errors.In general we need a > 15x sequencing depth

Kelley et al., 2010 Sequencing depth

Page 13: OUTLINE - DTU

SEQUENCING DEPTH

Reference genome

Dep

th o

f cov

erag

e

1x2x3x4x5x

How many times that your data covers the genome (average).

A

Page 14: OUTLINE - DTU

SEQUENCING DEPTH

N: Number of readsL: Read lengthG: Genome sizeC: Sequencing depth

Page 15: OUTLINE - DTU

GENOME COVERAGE

Reference genome

Breadth of coverage

How much of the reference genome is covered by your data

Page 16: OUTLINE - DTU

GENOME COVERAGE

Reference genome

Breadth of coverage

Uncovered part of the genome

Page 17: OUTLINE - DTU

MERGE PAIRED END READS

• Merge overlapping pairs into single longer read

• Smart because Illumina reads have low quality in the 3’

• Very useful for de novo assembly

Insert size: 500ntReads: 100ntMiddle: 300nt

Insert size: 180ntReads: 100ntMiddle: -20nt

Overlap

Page 18: OUTLINE - DTU

454 / ION TORRENT DATA

• Main problem is indels at homopolymer runs

• (Trim homopolymers), trim trailing poor

quality bases

• Remove very short reads

• For de novo assembly, adapters should be

removed (prinseq)

• For alignment we use Smith- Waterman

(local) so less important

Page 19: OUTLINE - DTU

Quality control for other technologies

• We heard about other newer technologies yesterday

• Pac bio, Nanopore etc.

• How can we do quality control on reads from these technologies?

• Long reads quality control

Page 20: OUTLINE - DTU

FINAL – BUT IMPORTANT NOTE

• Lots of data - storage is expensive!

• Keep data compressed whenever

possible (gzip, bzip, bam)

• Remove intermediate files and files that

can easily be re-created