Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Exercise 1 Review

tophat -o A -G testgenome.gff3 --no-novel-juncs

testgenome a.fastq.gz

Running Tophat

-o A : write the result files into a directory “A”

-G testgenome.gff3: use the gene annotation file testgenome.gff3

--no-novel-juncs: do not detect novel splicing junctions

testgenome: bowtie2 index reference genome

a.fastq.gz: RNA-seq data file.

Tophat command:

tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome

a.fastq.gz

• Default, no argument: detect novel junctions

• No novel junctions: -G myAnnotation.gtf --no-novel-juncs

• Doing both: -G myAnnotation.gtf

Exercise 1 Review

Exercise 1 Review

tophat -p 8 -o A -G testgenome.gff3 --no-novel-

juncs testgenome a.fastq.gz

Running Tophat using multiple threads:

-p 8: using 8 cores

Running Tophat on BioHPC Lab

Computers

• General workstations

8-core, 16G RAM, 0.5T hard drive

1 job at a time, using “-p 8”

• Medium memory workstations

24-core, 128G RAM, 1T SSD drive, 4T hard drive

4 jobs (6 cores per job) at a time;

• Large memory workstations

64 cores; 512GB RAM; 9.4TB HDD;1TB SSD;

10 jobs (6 cores per job) at a time;

Genome Databases for Tophat

• On /local_data directory:

human, mouse, Drosophila, C. elegans, yeast,

Arabidopsis, maize.

• On /shared_data/genome_db/:

rice, grape, apple, older versions of databases.

How to prepare Tophat genome

database

bowtie2-build rice7.fa rice7

Genome

fasta fileGive the

database a name

TOPHAT command in 2-step approach

tophat -G myAnnotation.gtf --transcriptome-index transcriptome

mygenome

Step 1. Build transcript sequences using the genome fasta and GFF file

tophat -p 8 -o sample1 -G myAnnotation.gtf --transcriptome-

index transcriptome/myAnnotation --no-novel-juncs mygenome

sample1.fastq.gz

Step 2. Using TOPHAT to align read to transcript and genome sequences

Default output from TOPHAT

• Up to 20 hits for the same read.

I. If more than 20 hits are found, randomly report

20;

II. It can be adjusted with --max-multihits

parameter.

• Maximum mismatch: 2

I. can be adjusted with --read-mismatches

Quantification

&

Normalization

Summarizing mapped reads into a gene level count

Different summarization strategies will result in the inclusion or exclusion of different

sets of reads in the table of counts.

Genome available, but

no annotation

Tophat: novel junctions

Cufflinks: build gff files

Cuffmerge: merge gff files

Cuffdiff: count reads using gff

RNA-seq Pipelines

Good genome and gene

annotation

Tophat: no novel junctions

Cuffdiff: count reads using gff

Differentially expressed gene detection

Differentially expressed gene detection

No reference genome

• Using Trinity to assemble a reference

transcriptome;

• Using tools provided with Trinity for transcript

quantification;

• Contact us for detailed instructions for

running Trinity.

Complications in quantification

1. Multi-mapped reads

Cufflinks

– uniformly divide each read to all mapped positions

– multi-mapped read correction (default off, can be enabled

with --multi-read-correct option)

HTSeq

– Count unique and multi-mapped reads separately

Complications in quantification

2. Assign reads to isoforms

Cufflinks

– Use its own model to estimate isoform abundance;

HTSeq

– A set of arbitrary rules specified by mode option,

including (a)skip or (b)counted towards each feature.

* Less problem with gene level read counting

Include in CUFFLINKS software:

CUFFLINKS vs CUFFDIFF

• Cufflinks

– Single library;

– Reference guide transcript assembly;

• Cuffdiff

– Multiple libraries;

– Quantification based on genes defined in GFF/GTF files;

– Identify differentially expressed genes

• Cufflinks

– Input: BAM file, GFF/GTF(optional)

– Output: GTF, isoform and gene FPKM

• Cuffdiff

– Input: BAM files, GFF/GTF(required)

– Output (at either gene or isoform level)

• FPKM, read count, read group tracking.

• Differential expressed genes

• Differential splicing test.



• Cufflinks

– Input: BAM file, GFF/GTF(optional)

– Output: GTF, isoform and gene FPKM

• Cuffdiff

– Input: BAM files, GFF/GTF(required)

– Output (at either gene or isoform level)

• FPKM, read count, read group tracking.

• Differential expressed genes

• Differential splicing test.

Commonly done

at gene level

Raw counts can

be used for

external software



Read-depth are not even across the same gene• Sequencing bias;

• RNA-seq provides relative abundance for the same gene across samples.

Fragment bias correction in cufflinks

• In some scenario, it can help to correct

batch effects due to different platforms,

protocols and reagents, et al.

--frag-bias-correct

Normalization methods

� Total-count normalization

• By total mapped reads

� Upper-quantile normalization

• By ead count of gene at upper-quantile

� Normalization by housekeeping genes

� Trimmed mean (TMM) normalization

Normalization methods

� Total-count normalization (FPKM, RPKM)

• By total mapped reads

� Upper-quantile normalization

• By read count of gene at upper-quantile

� Normalization by housekeeping genes

� Trimmed mean (TMM) normalization

cuffdiff

EdgeR & DESeq

Default

Normalization for RNA-seq data

Robinson & Oshlack 2010 Genome Biology 2010, 11:R25.

Median log-ratio of the housekeeping genes

Estimated TMM normalization factor

Log intensity ratio (M) vs average log intensity (A)

Statistical modeling of gene expression

and test for differentially expressed genes

1. Variance from biological replicates is large, and

cannot be modelled with Poisson distribution.

2. EdgeR and DESeq used related negative binomial

distribution (NB), and is increasing widely used for

detection for differentially expressed genes.

3. Each software has a built-in multiple test (e.g.

Benjamini-Hochberg) to give the FDR rate.

Rapaport F et al.

Genome Biology,

2013 14:R95

Comparison of Methods

TOPHAT -> BAM files

CUFFDIFF -> counts in

genes.read_group_tracking

EdgeR -> TMM normalization

EdgeR -> GLM for DE gene detection

RNA-seq Workflow at Bioinformatics Facility

http://cbsu.tc.cornell.edu/lab/doc/rna_seq_draft_v7.pdf

MDS plot of the samples

• Check reproducibility from replicates, remove outliers;

• Check batch effects;

Commercial software resource at

Cornell

• General RNA-seq data analysis

– Geneious

– DNASTAR

• Pathway analysis

– Ingenuity

• RNA-seq statistics and Pathway

– Genespring

• License informationa: http://www.biotech.cornell.edu/node/137

• BioHPC lab has one Windows computer that can run the commercial software

Exercise

• Using cuffdiff to identify differentially

expressed genes of rice plants grown at two

different conditions A and B. There are two

replicates for each condition.

• Using EdgeR package to make MDS plot of the

4 libraries.

Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf · Exercise 1 Review tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome a.fastq.gz

Documents