This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Churchill, March 15
Bult, Lecture 5
Bult, Lecture 6
Hibbs, Lectures 10 and 11
Blake, Lecture 16 and 17
Project Steps
• Find and Download Array Data• Normalize Array Data• Analyze Data
– i.e., generate gene lists• Differentially expressed genes, genes in clusters, etc.
• Interpret Gene Lists– Use the annotations of genes in your lists
• Gene Ontology terms are available for many organisms, but not all
Getting The Data
• Search GEO (or whatever) for a data set of interest.
• Download the data files– e.g., Affy .CEL files, Affy .CDF files, etc.
• Upload to home directory
Normalize the Data
• Sent you all a script (2/23/2012) to RMA normalize the Ackerman array data available from my home directory
– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain
– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc
• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known
bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related
• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related
• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related
– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional description
• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and
description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,
Illumina stores quality scores ranging from 0-62;Sanger quality scores range from 0-93.
Solexa quality scores have to be converted to PHRED quality scores.
SAM (Sequence Alignment/Map)
• It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format– SAM is the output of aligners that map reads to a
reference genome– Tab delimited w/ header section and alignment
section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields
Build and share data and analysis workflowsNo programming experience requiredStrong and growing development and user community
Tools HistoryDialog/Parameter Selection
Tutorial Web Sitehttp://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml
Tutorial 5
RNA Seq Workflow• Convert data to FASTQ• Upload files to Galaxy• Quality Control
– Throw out low quality sequence reads, etc.• Map reads to a reference genome
– Many algorithms available– Trade off between speed and sensitivity
• Data summarization– Associating alignments with genome annotations– Counts
• Data Visualization• Statistical Analysis
Typical RNA_Seq Project Work Flow
Sequencing Sequencing
Tissue Sample Tissue Sample
Cufflinks Cufflinks
TopHat TopHat
FASTQ file FASTQ file
QC QC
Gene/Transcript/Exon Expression
Gene/Transcript/Exon Expression
VisualizationVisualization
Total RNA Total RNA mRNA mRNA cDNA cDNA
Statistical Analysis
Statistical Analysis
JAX Computational Sciences Service
TopHat
Trapnell et al. (2009). Bioinformatics 25:1105-1111.
http://tophat.cbcb.umd.edu/
Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.
TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.
Trapnell C et al. Bioinformatics 2009;25:1105-1111
TopHat is built on the Bowtie alignment algorithm.
Cufflinks
Trapnell et al. (2010). Nature Biotechnology 28:511-515.
http://cufflinks.cbcb.umd.edu/
• Assembles transcripts,• Estimates their abundances, and •Tests for differential expression and regulation in RNA-Seq samples