RNA-Seq Transcriptome Profiling
Dec 13, 2015
RNA-Seq Transcriptome Profiling
Before we start: Align sequence reads to the reference genomeThe most time-consuming part of the analysis is doing the alignments of the reads (in Sanger fastq format) for all replicates against the reference genome.
Overview: This training module is designed to provide a hands on experience in using RNA-Seq for transcriptome profiling.
Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?
RNA-seq in the Discovery Environment
Scientific Objective
LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF).
Mutations in the HY5 gene cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response.
We will use RNA-seq to compare the transcriptomes of seedlings from WT and hy5 genetic backgrounds to identify HY5-regulated genes.
Samples
• Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466)
• Two replicates each of RNA-seq runs for Wild-type and hy5 mutant seedlings.
Specific Objectives
By the end of this module, you should
1)Be more familiar with the DE user interface
1)Understand the starting data for RNA-seq analysis
1)Be able to align short sequence reads with a reference genome in the DE
1)Be able to analyze differential gene expression in the DE
1)Be able to visualize RNA-Seq data in Atmosphere
RNA-Seq Conceptual Overview
Image source: http://www.bgisequence.com
RNA-Seq Data
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
…Now What?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Bioinformagician
$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq
$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt
$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
Your RNA-Seq Data
Your transformed RNA-Seq Data
RNA-Seq Analysis Workflow
Tophat (bowtie)
Cufflinks
Cuffmerge
Cuffdiff
CummeRbund
Your Data
iPlant Data Store
FASTQ
Disco
very E
nviro
nm
en
t A
tmo
sph
ere
Quick Summary
Find D
iffere
ntially
Expre
ssed genes
Align to
Genome: T
opHat
View Alig
nments: IGV
Differe
ntial E
xpressio
n: CuffD
iff
Download R
eads from S
RA
Export Reads to
FASTQ
Import SRA data from NCBI SRA
Extract FASTQ files from the
downloaded SRA archives
Pre-Configured: Getting the RNA-seq Data
Examining Data Quality with fastQC
Examining Data Quality with fastQC
RNA-Seq Workflow Overview
Align the four FASTQ files to Arabidopsis genome using Tophat
Align Reads to the Genome
TopHat
• TopHat is one of many applications for aligning short sequence reads to a reference genome.
• It uses the BOWTIE aligner internally.
• Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc.
RNA-seq Sample Read Statistics
• Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/).
• Reads retained by TopHat are shown below
Sequence run WT-1 WT-2 hy5-1 hy5-2
Reads 10,866,702 10,276,268 13,410,011 12,471,462
Seq. (Mbase) 445.5 421.3 549.8 511.3
ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutantBackground (> 9-fold p=0). Compare to gene on right lacking differential expression
RNA-Seq Workflow Overview
CuffDiff
• CuffLinks is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.
• CuffDiff is a program within CuffLinks that compares transcript abundance between samples
Examining Differential Gene Expression
Examining the Gene Expression Data
Filter CuffDiff results for up or down-regulated gene expression in hy5 seedlings
Differentially expressed genes
Differentially expressed genes
Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to1)Select genes with minimum two-fold expression difference2)Select genes with significant differential expression (q <= 0.05)3)Add gene descriptions