Top Banner
RNA-Seq Transcriptome Profiling
27

RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Dec 13, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-Seq Transcriptome Profiling

Page 2: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Before we start: Align sequence reads to the reference genomeThe most time-consuming part of the analysis is doing the alignments of the reads (in Sanger fastq format) for all replicates against the reference genome.

Page 3: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Overview: This training module is designed to provide a hands on experience in using RNA-Seq for transcriptome profiling.

Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?

RNA-seq in the Discovery Environment

Page 4: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Scientific Objective

LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF).

Mutations in the HY5 gene cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response.

We will use RNA-seq to compare the transcriptomes of seedlings from WT and hy5 genetic backgrounds to identify HY5-regulated genes.

Page 5: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Samples

• Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466)

• Two replicates each of RNA-seq runs for Wild-type and hy5 mutant seedlings.

Page 6: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Specific Objectives

By the end of this module, you should

1)Be more familiar with the DE user interface

1)Understand the starting data for RNA-seq analysis

1)Be able to align short sequence reads with a reference genome in the DE

1)Be able to analyze differential gene expression in the DE

1)Be able to visualize RNA-Seq data in Atmosphere

Page 7: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-Seq Conceptual Overview

Image source: http://www.bgisequence.com

Page 8: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-Seq Data

@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

…Now What?

Page 9: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

Bioinformagician

Page 10: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Page 11: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam

$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt

$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam

Your RNA-Seq Data

Your transformed RNA-Seq Data

Page 12: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-Seq Analysis Workflow

Tophat (bowtie)

Cufflinks

Cuffmerge

Cuffdiff

CummeRbund

Your Data

iPlant Data Store

FASTQ

Disco

very E

nviro

nm

en

t A

tmo

sph

ere

Page 13: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Quick Summary

Find D

iffere

ntially

Expre

ssed genes

Align to

Genome: T

opHat

View Alig

nments: IGV

Differe

ntial E

xpressio

n: CuffD

iff

Download R

eads from S

RA

Export Reads to

FASTQ

Page 14: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Import SRA data from NCBI SRA

Extract FASTQ files from the

downloaded SRA archives

Pre-Configured: Getting the RNA-seq Data

Page 15: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Examining Data Quality with fastQC

Page 16: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Examining Data Quality with fastQC

Page 17: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-Seq Workflow Overview

Page 18: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Align the four FASTQ files to Arabidopsis genome using Tophat

Align Reads to the Genome

Page 19: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

TopHat

• TopHat is one of many applications for aligning short sequence reads to a reference genome.

• It uses the BOWTIE aligner internally.

• Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc.

Page 20: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-seq Sample Read Statistics

• Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/).

• Reads retained by TopHat are shown below

Sequence run WT-1 WT-2 hy5-1 hy5-2

Reads 10,866,702 10,276,268 13,410,011 12,471,462

Seq. (Mbase) 445.5 421.3 549.8 511.3

Page 21: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutantBackground (> 9-fold p=0). Compare to gene on right lacking differential expression

Page 22: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

RNA-Seq Workflow Overview

Page 23: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

CuffDiff

• CuffLinks is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.

• CuffDiff is a program within CuffLinks that compares transcript abundance between samples

Page 24: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Examining Differential Gene Expression

Page 25: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Examining the Gene Expression Data

Page 26: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Filter CuffDiff results for up or down-regulated gene expression in hy5 seedlings

Differentially expressed genes

Page 27: RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Differentially expressed genes

Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to1)Select genes with minimum two-fold expression difference2)Select genes with significant differential expression (q <= 0.05)3)Add gene descriptions