Top Banner
Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz Running Tophat -o A : write the result files into a directory “A” -G testgenome.gff3: use the gene annotation file testgenome.gff3 --no-novel-juncs: do not detect novel splicing junctions testgenome: bowtie2 index reference genome a.fastq.gz: RNA-seq data file.
28

Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Sep 06, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Exercise 1 Review

tophat -o A -G testgenome.gff3 --no-novel-juncs

testgenome a.fastq.gz

Running Tophat

-o A : write the result files into a directory “A”

-G testgenome.gff3: use the gene annotation file testgenome.gff3

--no-novel-juncs: do not detect novel splicing junctions

testgenome: bowtie2 index reference genome

a.fastq.gz: RNA-seq data file.

Page 2: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Tophat command:

tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome

a.fastq.gz

• Default, no argument: detect novel junctions

• No novel junctions: -G myAnnotation.gtf --no-novel-juncs

• Doing both: -G myAnnotation.gtf

Exercise 1 Review

Page 3: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Exercise 1 Review

tophat -p 8 -o A -G testgenome.gff3 --no-novel-

juncs testgenome a.fastq.gz

Running Tophat using multiple threads:

-p 8: using 8 cores

Page 4: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Running Tophat on BioHPC Lab

Computers

• General workstations

8-core, 16G RAM, 0.5T hard drive

1 job at a time, using “-p 8”

• Medium memory workstations

24-core, 128G RAM, 1T SSD drive, 4T hard drive

4 jobs (6 cores per job) at a time;

• Large memory workstations

64 cores; 512GB RAM; 9.4TB HDD;1TB SSD;

10 jobs (6 cores per job) at a time;

Page 5: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Genome Databases for Tophat

• On /local_data directory:

human, mouse, Drosophila, C. elegans, yeast,

Arabidopsis, maize.

• On /shared_data/genome_db/:

rice, grape, apple, older versions of databases.

Page 6: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

How to prepare Tophat genome

database

bowtie2-build rice7.fa rice7

Genome

fasta fileGive the

database a name

Page 7: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

TOPHAT command in 2-step approach

tophat -G myAnnotation.gtf --transcriptome-index transcriptome

mygenome

Step 1. Build transcript sequences using the genome fasta and GFF file

tophat -p 8 -o sample1 -G myAnnotation.gtf --transcriptome-

index transcriptome/myAnnotation --no-novel-juncs mygenome

sample1.fastq.gz

Step 2. Using TOPHAT to align read to transcript and genome sequences

Page 8: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Default output from TOPHAT

• Up to 20 hits for the same read.

I. If more than 20 hits are found, randomly report

20;

II. It can be adjusted with --max-multihits

parameter.

• Maximum mismatch: 2

I. can be adjusted with --read-mismatches

Page 9: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Quantification

&

Normalization

Page 10: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Summarizing mapped reads into a gene level count

Different summarization strategies will result in the inclusion or exclusion of different

sets of reads in the table of counts.

Page 11: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Genome available, but

no annotation

Tophat: novel junctions

Cufflinks: build gff files

Cuffmerge: merge gff files

Cuffdiff: count reads using gff

RNA-seq Pipelines

Good genome and gene

annotation

Tophat: no novel junctions

Cuffdiff: count reads using gff

Differentially expressed gene detection

Differentially expressed gene detection

Page 12: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

No reference genome

• Using Trinity to assemble a reference

transcriptome;

• Using tools provided with Trinity for transcript

quantification;

• Contact us for detailed instructions for

running Trinity.

Page 13: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Complications in quantification

1. Multi-mapped reads

Cufflinks

– uniformly divide each read to all mapped positions

– multi-mapped read correction (default off, can be enabled

with --multi-read-correct option)

HTSeq

– Count unique and multi-mapped reads separately

Page 14: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Complications in quantification

2. Assign reads to isoforms

Cufflinks

– Use its own model to estimate isoform abundance;

HTSeq

– A set of arbitrary rules specified by mode option,

including (a)skip or (b)counted towards each feature.

* Less problem with gene level read counting

Page 15: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Include in CUFFLINKS software:

CUFFLINKS vs CUFFDIFF

• Cufflinks

– Single library;

– Reference guide transcript assembly;

• Cuffdiff

– Multiple libraries;

– Quantification based on genes defined in GFF/GTF files;

– Identify differentially expressed genes

Page 16: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

• Cufflinks

– Input: BAM file, GFF/GTF(optional)

– Output: GTF, isoform and gene FPKM

• Cuffdiff

– Input: BAM files, GFF/GTF(required)

– Output (at either gene or isoform level)

• FPKM, read count, read group tracking.

• Differential expressed genes

• Differential splicing test.

Include in CUFFLINKS software:

CUFFLINKS vs CUFFDIFF

Page 17: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

• Cufflinks

– Input: BAM file, GFF/GTF(optional)

– Output: GTF, isoform and gene FPKM

• Cuffdiff

– Input: BAM files, GFF/GTF(required)

– Output (at either gene or isoform level)

• FPKM, read count, read group tracking.

• Differential expressed genes

• Differential splicing test.

Commonly done

at gene level

Raw counts can

be used for

external software

Include in CUFFLINKS software:

CUFFLINKS vs CUFFDIFF

Page 18: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Read-depth are not even across the same gene• Sequencing bias;

• RNA-seq provides relative abundance for the same gene across samples.

Page 19: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Fragment bias correction in cufflinks

• In some scenario, it can help to correct

batch effects due to different platforms,

protocols and reagents, et al.

--frag-bias-correct

Page 20: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Normalization methods

� Total-count normalization

• By total mapped reads

� Upper-quantile normalization

• By ead count of gene at upper-quantile

� Normalization by housekeeping genes

� Trimmed mean (TMM) normalization

Page 21: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Normalization methods

� Total-count normalization (FPKM, RPKM)

• By total mapped reads

� Upper-quantile normalization

• By read count of gene at upper-quantile

� Normalization by housekeeping genes

� Trimmed mean (TMM) normalization

cuffdiff

EdgeR & DESeq

Default

Page 22: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Normalization for RNA-seq data

Robinson & Oshlack 2010 Genome Biology 2010, 11:R25.

Median log-ratio of the housekeeping genes

Estimated TMM normalization factor

Log intensity ratio (M) vs average log intensity (A)

Page 23: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Statistical modeling of gene expression

and test for differentially expressed genes

1. Variance from biological replicates is large, and

cannot be modelled with Poisson distribution.

2. EdgeR and DESeq used related negative binomial

distribution (NB), and is increasing widely used for

detection for differentially expressed genes.

3. Each software has a built-in multiple test (e.g.

Benjamini-Hochberg) to give the FDR rate.

Page 24: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Rapaport F et al.

Genome Biology,

2013 14:R95

Comparison of Methods

Page 25: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

TOPHAT -> BAM files

CUFFDIFF -> counts in

genes.read_group_tracking

EdgeR -> TMM normalization

EdgeR -> GLM for DE gene detection

RNA-seq Workflow at Bioinformatics Facility

http://cbsu.tc.cornell.edu/lab/doc/rna_seq_draft_v7.pdf

Page 26: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

MDS plot of the samples

• Check reproducibility from replicates, remove outliers;

• Check batch effects;

Page 27: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Commercial software resource at

Cornell

• General RNA-seq data analysis

– Geneious

– DNASTAR

• Pathway analysis

– Ingenuity

• RNA-seq statistics and Pathway

– Genespring

• License informationa: http://www.biotech.cornell.edu/node/137

• BioHPC lab has one Windows computer that can run the commercial software

Page 28: Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Exercise

• Using cuffdiff to identify differentially

expressed genes of rice plants grown at two

different conditions A and B. There are two

replicates for each condition.

• Using EdgeR package to make MDS plot of the

4 libraries.