Top Banner
Surya Saha [email protected] BTI PGRP Summer Internship Program 2014 Slides: https://bitly.com/BioinfoInternEx2014 Quality Control of NGS Data
21

Quality Control of NGS Data

May 10, 2015

Download

Education

Surya Saha

BTI PGRP Summer Internship Program 2014

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Quality Control of NGS Data

Surya Saha [email protected]

BTI PGRP Summer Internship Program 2014

Slides: https://bitly.com/BioinfoInternEx2014

Quality Control of NGS Data

Page 2: Quality Control of NGS Data

1. Evaluation

2. Preprocessing

Quality Control of NGS Data

7/8/2014 BTI PGRP Summer Internship Program 2014 2 Slide credit: Aureliano Bombarely

Page 3: Quality Control of NGS Data

Goal:

Learn the use of read evaluation programs keeping

attention in relevant parameters such as quality score and

length distributions and reads duplications.

Data: (Illumina data for two tomato ripening stages)

/home/bioinfo/Data/ch4_demo_dataset.tar.gz

Tools: tar -zxvf (command line, untar and unzip the files)

head (command line, take a quick look of the files)

mv (command line, change the name of the files)

grep (command line, find/count patterns in files)

FASTX toolkit (command line, process fasta/fastq)

FastQC (gui, to calculate several stats for each file)

Evaluation

7/8/2014 BTI PGRP Summer Internship Program 2014 3 Slide credit: Aureliano Bombarely

Page 4: Quality Control of NGS Data

Exercise 1:

1. Untar and Unzip the file:

/home/bioinfo/Data/ch4_demo_dataset.tar.gz

2. Raw data will be found in two dirs: breaker and

immature_fruit. Print the first 10 lines for the files:

SRR404331_ch4.fq, SRR404333_ch4.fq,

SRR404334_ch4.fq and SRR404336_ch4.fq.

Question 1.1: Do these files have fastq format?

3. Change the extension of the .fq files to .fastq

Evaluation

7/8/2014 BTI PGRP Summer Internship Program 2014 4 Slide credit: Aureliano Bombarely

Page 5: Quality Control of NGS Data

Exercise 1:

4. Count number of sequences in each fastq file using

commands you learnt earlier.

5. Convert the fastq files to fasta.

6. Explore other tools in the FASTX toolkit.

7. Now count the number of sequences in fasta file and see

if the number of sequences has changed.

Evaluation

Tip: Use ‘grep’

Tip: Use ‘fastq_to_fasta -h’ to see help Use Google if you are stuck

7/8/2014 BTI PGRP Summer Internship Program 2014 5 Slide credit: Aureliano Bombarely

Page 6: Quality Control of NGS Data

Evaluation: Sequence Quality

Good Illumina dataset

7/8/2014 BTI PGRP Summer Internship Program 2014 6

Page 7: Quality Control of NGS Data

Evaluation: Sequence Quality

7/8/2014 BTI PGRP Summer Internship Program 2014 7

Good Illumina dataset

Poor Illumina dataset

Page 8: Quality Control of NGS Data

Evaluation: Sequence Quality

7/8/2014 BTI PGRP Summer Internship Program 2014 8

454

Pacific Biosciences

Page 9: Quality Control of NGS Data

Evaluation: Sequence Content

Good Illumina dataset

7/8/2014 BTI PGRP Summer Internship Program 2014 9

Page 10: Quality Control of NGS Data

Evaluation: Sequence Content

7/8/2014 BTI PGRP Summer Internship Program 2014 10

Good Illumina dataset

Poor Illumina dataset

Page 11: Quality Control of NGS Data

Evaluation: Duplication

Good Illumina dataset

7/8/2014 BTI PGRP Summer Internship Program 2014 11

Page 12: Quality Control of NGS Data

Evaluation: Duplication

7/8/2014 BTI PGRP Summer Internship Program 2014 12

Good Illumina dataset

Poor Illumina dataset

Page 13: Quality Control of NGS Data

Evaluation: Overrepresented Sequences

Good Illumina dataset

7/8/2014 BTI PGRP Summer Internship Program 2014 13

Page 14: Quality Control of NGS Data

Evaluation: Overrepresented Sequences

7/8/2014 BTI PGRP Summer Internship Program 2014 14

Good Illumina dataset

Poor Illumina dataset

Page 15: Quality Control of NGS Data

Evaluation: Kmer content

Good Illumina dataset

7/8/2014 BTI PGRP Summer Internship Program 2014 15

Page 16: Quality Control of NGS Data

Evaluation: Kmer content

7/8/2014 BTI PGRP Summer Internship Program 2014 16

Good Illumina dataset

Poor Illumina dataset

Page 17: Quality Control of NGS Data

Evaluation: Kmer content

7/8/2014 BTI PGRP Summer Internship Program 2014 17

454

Pacific Biosciences

Page 18: Quality Control of NGS Data

Question 2.2: How many sequences there are per file in FastQC?

Question 2.3: Which is the length range for these reads?

Question 2.4: Which is the quality score range for these reads? Which

one looks best quality-wise?

Question 2.5: Do these datasets have read overrepresentation?

Question 2.6: Looking into the kmer content, do you think that the samples

have an adaptor?

Evaluation Exercise 2:

1.Type ‘fastqc’ to start the FastQC program. Load the four

fastq sequence files in the program.

7/8/2014 BTI PGRP Summer Internship Program 2014 18

Page 19: Quality Control of NGS Data

Goal:

Trim the low quality ends of the reads and remove

the short reads.

Data: (Illumina data for two tomato ripening stages)

ch4_demo_dataset.tar.gz

Tools: fastq-mcf (command line tool to process reads)

FastQC (gui, to calculate several stats for each file)

Preprocessing

7/8/2014 BTI PGRP Summer Internship Program 2014 19

Page 20: Quality Control of NGS Data

Exercise 3:

• Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a

dapters1.fa

• Run the read processing program over each of the datasets

using

• Min. qscore of 30

• Min. length of 40 bp

• Type ‘fastqc’ to start the FastQC program. Load the four

new fastq sequence files. Compare the results with the

previous datasets.

Preprocessing

Tip: Use ‘fastqc -h’ to see help

7/8/2014 BTI PGRP Summer Internship Program 2014 20

Page 21: Quality Control of NGS Data

Need Help??

7/8/2014 BTI PGRP Summer Internship Program 2014 21

Solutions: https://bitly.com/BioinfoInternExSol2014