8/3/2019 Micro Array Data Analysis 06
1/55
Microarray Data Analysis
The Bioinformatics side of the bench
8/3/2019 Micro Array Data Analysis 06
2/55
The anatomy of your data files
from Affymetrix array analysis
.DAT= image file (107 pixels)
.CEL= measured cell intensities
.CDF= cell descriptions files (identifyprobe sets and probe set pairs)
.CHP= calculated probe set data
.RPT= report generated from .CHP
8/3/2019 Micro Array Data Analysis 06
3/55
Quality Control (QC) of the
chipvisual inspection
Look at the .DAT file or the .CHP fileimage
Scratches? Spots?
Corners and outside bordercheckerboard appearance (B2 oligo)
Positive hybridization control
Used by software to place grid over image
Array name is written out in oligos!
8/3/2019 Micro Array Data Analysis 06
4/55
8/3/2019 Micro Array Data Analysis 06
5/55
Chip defects
8/3/2019 Micro Array Data Analysis 06
6/55
Internal controls
B. subtilisgenes (added poly-A tails)
Assessment of quality of sample preparation
Also as hybridization controls
Hybridization controls (bioB, bioC, bioD, cre)
E. coli and P1 bacteriophage biotin-labeled cRNAs
Spiked into the hybridization cocktail
Assess hybridization efficiency
Actin and GAPDH assess RNA sample/assay quality
Compare signal values from 3 end to signal valuesfrom 5 end
ratio generally should not exceed 3
Percent genes present (%P)
Replicate samples - similar %P values
8/3/2019 Micro Array Data Analysis 06
7/55
1. Experimental Design
2. Image Analysisscan to intensity measures (raw
data)
3. Normalizationclean data
4. More low level analysis-fold change, ANOVA,
data filtering
5. Data mining-how to interpret > 6000 measures
Databases
Software
Techniques-clustering, pattern recognition etc.
Comparing to prior studies, across platforms?
6. Validation
Microarray Data Process/Outline
8/3/2019 Micro Array Data Analysis 06
8/55
Experimental Design
A good microarray design has 4 elements
1. A clearly defined biological question or hypothesis
2. Treatment, perturbation and observation of biologicalmaterials should minimize systematic bias
3. Simple and statistically sound arrangement that minimizescost and gains maximal information
4. Compliance with MIAME (minimal information about
microarray experiment)
The goal of statistics is to find signals in a sea of noise
The goal of exp. design is to reduce the noise so signals canbe found with as small a sample size as possible
8/3/2019 Micro Array Data Analysis 06
9/55
Observational Study vs.
Designed Experiment
Observational study-
Investigator is a passive observer whomeasures variables of interest, but does not
attempt to influence the responses
Designed Experiment-
Investigator intervenes in natural course ofevents
What type is our DMSO exp?
8/3/2019 Micro Array Data Analysis 06
10/55
Experimental Replicates
Why? In any exp. system there is a certain amount of
noiseso even 2 identical processes yield slightlydifferent results
Sources?
In order to understand how much variation there isit is necessary to repeat an exp a # of independenttimes
Replicates allow us to use statistical tests toascertain if the differences we see are real
8/3/2019 Micro Array Data Analysis 06
11/55
8/3/2019 Micro Array Data Analysis 06
12/55
Technical vs. Biological Replicates
As we progress from the starting material to the scanned
image we are moving from a system dominated by biologicaleffects through one dominated by chemistry and physics noise
Within Affy platform the dominant variation is usually of abiological nature thus best strategy is to produce replicates ashigh up the experimental tree as possible
8/3/2019 Micro Array Data Analysis 06
13/55
Low level data analysis / pre-processing
Varying biological or cellularcomposition among sampletypes.
Differences in samplepreparation, labeling orhybridization
Non specific cross-hybridization of target toprobes.
Lead to systemic differencesbetween individual arrays
Raw Data Quality Control
Scaling
Normalization andfiltering.
8/3/2019 Micro Array Data Analysis 06
14/55
Image Analysis - Raw Data
8/3/2019 Micro Array Data Analysis 06
15/55
From probe level signals to gene abundance
estimates
The job of the expression summary algorithm is
to take a set of Perfect Match (PM) and Mis-Match (MM) probes, and use these to generatea single value representing the estimatedamount of transcript in solution, as measuredby that probeset.
To do this, .DAT files containing array images are firstprocessed to produce a .CEL file, which containsmeasured intensities for each probe on the array.
It is the .CEL files that are analyzed by the expressioncalling algorithm.
http://bioinformatics.picr.man.ac.uk/mbcf/example_ma.jsphttp://bioinformatics.picr.man.ac.uk/mbcf/example_ma.jsp8/3/2019 Micro Array Data Analysis 06
16/55
MAS 5.0 output files
For each transcript (gene) on the chip:
signal intensity
a present or absent call (presence call) p-value (significance value) for making that
call
Each gene associated with GenBank
accession number (NCBI database)
8/3/2019 Micro Array Data Analysis 06
17/55
How are transcripts determined to be
present or absent?
Probe pair (PM vs. MM) intensities
generate a detection p-value
assign Present, Absent, or Marginalcall for transcript
Every probe pair in a probe SET has
a potential vote for presence call
8/3/2019 Micro Array Data Analysis 06
18/55
PM and MM Probes
The purpose of each MM probe is to provide a directmeasure of background and stray-signal (perhaps dueto cross-hybridization) for its perfect-match partner. Inmost situations the signal from each probe-pair is simplythe difference PM - MM.
For some probe-pairs, however, the MM signal isgreater than the PM value; we have an apparently
impossible measure of background.
8/3/2019 Micro Array Data Analysis 06
19/55
Thank goodness for software!!!
MAS 5.0 does these calculations for you .CHP file
Basic analysis in MAS 5.0, but it wonthandle replicates
Import MAS 5.0 (.CHP) data into othersoftware, Genesifter, GCOS, SpotFire,
and many others
8/3/2019 Micro Array Data Analysis 06
20/55
Signal Intensity Following these calculations, the MAS 5.0
algorithm now has a measure of thesignal for each probe in a probeset.
Other algortihms, ex RMA, GCRMA,dCHIP, PLIER and others have beendeveloped by academic teams to improvethe precision and accuracy of this
calculation In our Exp we will use RMA and GCRMA
8/3/2019 Micro Array Data Analysis 06
21/55
How do we want to analyze
this data?
Pairwise analysis is most appropriate Control vs. DMSO
List of genes that are upregulated ordownregulated
Determine fold up or down cutoffs What is significant?
1.5 fold up/down? 2 fold up/down? 10 fold up/down?
8/3/2019 Micro Array Data Analysis 06
22/55
Normalization - clean data
Normalizing data allows
comparisons ACROSS differentchips
Intensity of fluorescent markers mightbe different from one batch to the other
Normalization allows us to compare
those chips without altering theinterpretation of changes in GENEEXPRESSION
8/3/2019 Micro Array Data Analysis 06
23/55
Why Normalize Data?
The experimental goal is to identify biological variation(expression changes between samples)
Technical variation can hide the real data
Unavoidable systematic bias should be recognized andcorrected
Normalization is necessary to effectively make comparisonsbetween chips-and sometimes within a single chip.
There are different methods of normalization the
assumptions of where variation exist will determine thenormalization techniques used.
Always look at data before and after normalization
Spike in controls can help show which method may be best
8/3/2019 Micro Array Data Analysis 06
24/55
Caveat
There is NO standard way toanalyze microarray data
Still figuring out how to get the bestanswers from microarrayexperiments
Best to combine knowledge ofbiology, statistics, and computers toget answers
8/3/2019 Micro Array Data Analysis 06
25/55
MAS 5.0 GCRMARMA
RMA
GCRMAMAS 5.0
Venn Diagrams
8/3/2019 Micro Array Data Analysis 06
26/55
Data processing is completednow what?
Fold change, ANOVA, Data filtering
8/3/2019 Micro Array Data Analysis 06
27/55
8/3/2019 Micro Array Data Analysis 06
28/55
8/3/2019 Micro Array Data Analysis 06
29/55
8/3/2019 Micro Array Data Analysis 06
30/55
8/3/2019 Micro Array Data Analysis 06
31/55
8/3/2019 Micro Array Data Analysis 06
32/55
8/3/2019 Micro Array Data Analysis 06
33/55
8/3/2019 Micro Array Data Analysis 06
34/55
8/3/2019 Micro Array Data Analysis 06
35/55
Where are we now?
Ran analysis, output is a GENELIST
List indicates what genes are up ordown regulated
p values for t-test
Graphs of signal levels
Absolute numbers not as important here asthe trends you see
Now what????
8/3/2019 Micro Array Data Analysis 06
36/55
What is the first set of genes on our chipsthat will be filtered out?
8/3/2019 Micro Array Data Analysis 06
37/55
Follow the links
Click on a gene
Find links to other databases
Follow links to discover what theprotein does
Now the fun part begins.
8/3/2019 Micro Array Data Analysis 06
38/55
Back to Biology
Do the changes you see in geneexpression make senseBIOLOGICALLY?
If they dont make sense, can you
hypothesize as to why those genesmight be changing?
Leads to many, many moreexperiments
8/3/2019 Micro Array Data Analysis 06
39/55
A Common Language for Annotation of
Genes from
Yeast, Flies and Mice
The Gene Ontologies
and Plants and Worms
and Humans
and anything else!
Gene Ontolog
8/3/2019 Micro Array Data Analysis 06
40/55
Gene Ontology
Objectives
GO represents concepts used to classifyspecific parts of our biological knowledge: Biological Process
Molecular Function
Cellular Component
GO develops a common language applicableto any organism
GO terms can be used to annotate geneproducts from any species, allowingcomparison of information across species
8/3/2019 Micro Array Data Analysis 06
41/55
Sriniga Srinivasan, Chief Ontologist, Yahoo!
The ontology. Dividing human
knowledge into a clean set of categoriesis a lot like trying to figure out where tofind that suspenseful black comedy atyour corner video store. Questionsinevitably come up, like are Movies partof Art or Entertainment? (Yahoo! liststhem under the latter.) -Wired
Magazine, May 1996
8/3/2019 Micro Array Data Analysis 06
42/55
Molecular Function= elemental activity/task
the tasks performed by individual gene products; examples arecarbohydrate bindingand ATPase activity
Biological Process= biological goal orobjective broad biological goals, such as mitosisor purine metabolism, that are
accomplished by ordered assemblies of molecular functions
Cellular Component= location or complex subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase IIholoenzyme
The 3 Gene Ontologies
8/3/2019 Micro Array Data Analysis 06
43/55
Function (what) Process (why)
Drive nail (into wood) Carpentry
Drive stake (into soil) Gardening
Smash roach Pest Control
Clowns juggling object Entertainment
Example:
Gene Product = hammer
Bi l i l E l
8/3/2019 Micro Array Data Analysis 06
44/55
Biological Examples
Molecular FunctionBiological Process Cellular Component
8/3/2019 Micro Array Data Analysis 06
45/55
Validation
Not enough to just do microarrays
Usually validate microarray results
via some other technique
rt-PCR
TaqMan
Northern analysis
Protein level analysis
No technique is perfect
8/3/2019 Micro Array Data Analysis 06
46/55
Dynamic Nature of Yeast Genome
8/3/2019 Micro Array Data Analysis 06
47/55
Dynamic Nature of Yeast Genome
eORF= essential
kORF= known
hORF= homologyidentified
shORF= short
tORF= transposonidentified
qORF= questionable
dORF= disabled
First published sequence claimed 6274 genes a # thathas been revised many times, why?
8/3/2019 Micro Array Data Analysis 06
48/55
The Affy detection oligonucleotide sequences are frozen at the timeof synthesis, how does this impact downstream data analysis?
6603
4373
1410
820
8/3/2019 Micro Array Data Analysis 06
49/55
term: MAPKKK cascade (mating sensuSaccharomyces)
goid: GO:0007244
definition: MAPKKK cascade involved in transductionof mating pheromone signal, as described inSaccharomyces
definition_reference: PMID:9561267
Terms, Definitions, IDs
8/3/2019 Micro Array Data Analysis 06
50/55
SGD
8/3/2019 Micro Array Data Analysis 06
51/55
8/3/2019 Micro Array Data Analysis 06
52/55
SGD public microarray data sets available
8/3/2019 Micro Array Data Analysis 06
53/55
for public query
8/3/2019 Micro Array Data Analysis 06
54/55
Homework
1. Go to http://www.yeastgenome.org/and find 3 candidate genes ofknown f(x) and one of undefined f(x) that you might predict to bealtered by DMSO treatment
2. What GO biological processes and molecular mechanisms areassociated with your candidate genes?
3. Where, subcellularly does the protein reside in the cell?
4. What other proteins are known or inferred to interact with yours? Howwas this interaction determined? Is this a genetic or physicalinteraction?
5. Find the expression of at least one of your known genes in anotherpublic ally deposited microarray data set?
1. Name of data set and how you found it?
2. What is the largest Fold change observed for this gene in the public study?
6. Now that you are microarray technology experts can you give me 3reasons why the observed transcript level difference may not beconfirmed through a second technology like RTQPCR?
Suggested Reading
http://www.yeastgenome.org/http://www.yeastgenome.org/8/3/2019 Micro Array Data Analysis 06
55/55
Suggested Reading