Microarray Analysis Microarray Analysis with a Small Number of with a Small Number of Replicates Replicates By Kung-Hua Chang & Dhondup Pemba By Kung-Hua Chang & Dhondup Pemba Mentors: Cecilie Boysen, Ph.D & Jim Breaux, Ph.D Mentors: Cecilie Boysen, Ph.D & Jim Breaux, Ph.D Southern California Bioinformatics Institute Southern California Bioinformatics Institute Summer 2005 Summer 2005 Funded By NSF/NIH Funded By NSF/NIH
34
Embed
Microarray Analysis with a Small Number of Replicates By Kung-Hua Chang & Dhondup Pemba By Kung-Hua Chang & Dhondup Pemba Mentors: Cecilie Boysen, Ph.D.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Microarray Analysis with a Microarray Analysis with a Small Number of ReplicatesSmall Number of Replicates
• Signal detection.Signal detection.Fluorescence Fluorescence detection of detection of hybridization hybridization between RNA target between RNA target and oligonucleotide and oligonucleotide probe.probe.
Functional Analysis (identify affected processes and pathways)
Statistical Analysis with a Small Statistical Analysis with a Small Number of ReplicatesNumber of Replicates
Overall objective: Perform end-to-end Overall objective: Perform end-to-end analysis on a client’s microarray data set analysis on a client’s microarray data set (from raw image to pathway analysis)(from raw image to pathway analysis)
Problem: Dataset contained a small number Problem: Dataset contained a small number of replicatesof replicates
OverviewOverview
Problem with small number of replicatesProblem with small number of replicates
Small number of replicates yields unreliable identification of gene variancesSmall number of replicates yields unreliable identification of gene variances
FOR MORE INFO...
Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays (Nitin et al.)
With seven replicates, we are more confident that gene 1 is With seven replicates, we are more confident that gene 1 is upregulatedupregulated
Approach to dealing with a small Approach to dealing with a small number of replicatesnumber of replicates
Analyze a larger data set that has a good number Analyze a larger data set that has a good number of replicates (n = 8x8).of replicates (n = 8x8).– Assume this is the “truth”Assume this is the “truth”
Analyze a randomly selected subset of this data Analyze a randomly selected subset of this data set (n = 3x3) using three different algorithms.set (n = 3x3) using three different algorithms.
Compare output from 8x8 analysis to 3x3 analysis.Compare output from 8x8 analysis to 3x3 analysis.– Decide how to analyze client’s data set based on resultsDecide how to analyze client’s data set based on results
SAMSAM Each gene receives a score based on the Each gene receives a score based on the
difference in average gene expression relative to difference in average gene expression relative to the standard deviation of the repeated the standard deviation of the repeated measurements.measurements.
Genes with scores greater than a threshold are Genes with scores greater than a threshold are considered significant.considered significant.
This threshold is determined by the false discovery This threshold is determined by the false discovery rate the user desires.rate the user desires.
FOR MORE INFO...
Significance analysis of microarrays applied to the ionizing radiation response(Tusher et al)
J-ScoreJ-Score
Each gene receives a score based on average Each gene receives a score based on average fold-change in gene expression relative to the fold-change in gene expression relative to the standard deviation of the repeated measurements.standard deviation of the repeated measurements.
Cut-off for selection of “significant” genes is Cut-off for selection of “significant” genes is arbitrary.arbitrary.
Cyber-T (Baldi & Long)Cyber-T (Baldi & Long)
Cyber-T ‘Cyber-T ‘Regularized t-testRegularized t-test’’ ““Assumes genes of similar expression levels have Assumes genes of similar expression levels have
similar measurement errors.similar measurement errors. The variance of any single gene can be estimated The variance of any single gene can be estimated
from the variance from a number of genes of similar from the variance from a number of genes of similar expression level.expression level.
The variance of any gene within any given treatment The variance of any gene within any given treatment can be estimated by the weighted average of a prior can be estimated by the weighted average of a prior estimate of variance for that gene.”estimate of variance for that gene.”
FOR MORE INFO...
Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework (Long et al).
At 1% False Discovery Rate (FDR) SAM At 1% False Discovery Rate (FDR) SAM 8x8 picked up 762 significant genes 8x8 picked up 762 significant genes (estimated number of false significant genes (estimated number of false significant genes = 8).= 8).
Agreement between SAM 8x8 and the top Agreement between SAM 8x8 and the top 1000 genes from the 3x3 methods:1000 genes from the 3x3 methods:
Results: Comparison between SAM Results: Comparison between SAM 8x8 and 3x3 methods8x8 and 3x3 methods
SAM 8X8 SAM 3X3 J-Score 3X3 Cyber-T 3X3
Total genes selected 762 1000 1000 1000
Total Match to SAM 8X8 FDR=1% --- 269 356 303
Estimated False Positives 8 731 to 739 644 to 652 697 to 705
Venn Diagram:Venn Diagram:
48 53
174
31
17 81
J-Score
Cyber-T
30
SAM
48 53
174
31
17 81
J-Score
Cyber-T
30
SAM
Results: Comparison between 3x3 Results: Comparison between 3x3 methodsmethods
Union of all Union of all three methods three methods = 433 unique = 433 unique genesgenes
Agreement between any two methods:Agreement between any two methods:
These findings are consistent with a These findings are consistent with a previous study by a group at NIH previous study by a group at NIH (Hosack et al.)::– Found that agreement between various Found that agreement between various
methods tested ranged from 7% to 60%.methods tested ranged from 7% to 60%.
Results: Comparison between 3x3 Results: Comparison between 3x3 methodsmethods
#Common %Common
SAM & J-Score 222 62.5SAM & Cyber-T 191 63.0
J-Score & Cyber-T 255 71.0
Possible Approaches for Final AnalysisPossible Approaches for Final Analysis
Method 1: Final set of significant genes is derived from the Method 1: Final set of significant genes is derived from the method that had the most overlap with SAM 8x8 (J-Score).method that had the most overlap with SAM 8x8 (J-Score).
Final result:Final result:– 1000 total significant genes1000 total significant genes– At most 356 true positivesAt most 356 true positives– At most 652 false positivesAt most 652 false positives
Pro:Pro:– Decent number of true positivesDecent number of true positives
Con:Con:– Large number of false positivesLarge number of false positives– Might be missing important genes found by other two methodsMight be missing important genes found by other two methods
Possible Approaches for Final AnalysisPossible Approaches for Final Analysis
Method 2: Final set of significant genes is the intersection Method 2: Final set of significant genes is the intersection of the three methods.of the three methods.
Final result:Final result:– 174 total significant genes174 total significant genes– At most 174 true positivesAt most 174 true positives– At most 8 false positivesAt most 8 false positives
Pro:Pro:– Lowest number of false positivesLowest number of false positives
Con:Con:– Lowest number of true positivesLowest number of true positives
Possible Approaches for Final AnalysisPossible Approaches for Final Analysis
Method 3: Final set of significant genes is the union of the Method 3: Final set of significant genes is the union of the three methodsthree methods
Final result:Final result:– 1631 total significant genes1631 total significant genes– At most 433 True positivesAt most 433 True positives– At most 1206 False positivesAt most 1206 False positives
Pro:Pro:– Highest number of true positives.Highest number of true positives.
Con:Con:– Highest number of false positivesHighest number of false positives
Final ApproachFinal Approach
Return the largest number of true positives to the Return the largest number of true positives to the client (Method 3).client (Method 3).
To deal with large number of potential false To deal with large number of potential false positives in the results, we rank each gene based positives in the results, we rank each gene based on the ranking from Cyber-T, J-Score, and SAM on the ranking from Cyber-T, J-Score, and SAM methods.methods.– For example, if “Gene 02” is ranked number 2 in Cyber-For example, if “Gene 02” is ranked number 2 in Cyber-
T, number 3 in J-Score, and number 4 in SAM, then the T, number 3 in J-Score, and number 4 in SAM, then the overall ranking is (2 + 3 + 4) / 3 = 3overall ranking is (2 + 3 + 4) / 3 = 3
– Higher ranking = more likely to be true positiveHigher ranking = more likely to be true positive
Example Output of Our ApproachExample Output of Our Approach
• Mapping to biological processes.Mapping to biological processes.
-- EASE, the Expression Analysis Systematic EASE, the Expression Analysis Systematic Explorer from the National Institute of Allergy Explorer from the National Institute of Allergy and Infectious Diseases at the National and Infectious Diseases at the National Institute of Health. Institute of Health.
• Mapping to pathways.Mapping to pathways.
-- PathwayAssist software from Ariadne PathwayAssist software from Ariadne Genomics. Genomics.
Mapping to biological processesMapping to biological processes
• The list of up and down The list of up and down regulated genes were regulated genes were inserted into EASE.inserted into EASE.
• The Lower the EASE score The Lower the EASE score the more highly the ranked the more highly the ranked process is.process is.
• Example of the top 14 Example of the top 14 processes, locations and processes, locations and functions found from our functions found from our significant genes.significant genes.
Mapping to pathwaysMapping to pathways
• Gene 1, 2 and Gene 1, 2 and 3 are 3 are significant up- significant up- or down-or down-regulated regulated genes by our genes by our combination combination methodmethod
• Investigation Investigation of gene 1 of gene 1 reveals gene 2 reveals gene 2 and 3 are and 3 are involved in involved in gene 1’s gene 1’s pathway.pathway.
Gene 2
Gene 1
Gene 3
ConclusionConclusion
Three algorithms for selecting differentially Three algorithms for selecting differentially expressed genes produced different lists of genes expressed genes produced different lists of genes with ~60% to 70% agreement.with ~60% to 70% agreement.
Taking the union of the results from the three Taking the union of the results from the three algorithms yielded the most true positives for our algorithms yielded the most true positives for our client.client.
Biological processes and pathways found through Biological processes and pathways found through functional analysis correspond to what we functional analysis correspond to what we expected based on samples studied.expected based on samples studied.– Helps to make microarray results more believable.Helps to make microarray results more believable.
ViaLogy has individual scripts that are used ViaLogy has individual scripts that are used to test quality of VMAxS output.to test quality of VMAxS output.
Current implementation requires working Current implementation requires working knowledge of R scripting.knowledge of R scripting.
Project: implement a user-friendly GUI Project: implement a user-friendly GUI program to execute multiple QC tests.program to execute multiple QC tests.
Optional window pops up if default parameters are not Optional window pops up if default parameters are not desireddesired
AcknowledgementsAcknowledgements
Dr. Sandra SharpDr. Sandra Sharp Dr. Wendie JohnstonDr. Wendie Johnston Dr. Jamil MomandDr. Jamil Momand Dr. Nancy Warter-PerezDr. Nancy Warter-Perez Other SoCalBSI Staff and FacultyOther SoCalBSI Staff and Faculty SoCalBSI 2005 Participants SoCalBSI 2005 Participants Lien Chung (SoCalBSI Lien Chung (SoCalBSI
Participant 2004)Participant 2004)
Dr. Cecilie BoysenDr. Cecilie Boysen Dr. Jim BreauxDr. Jim Breaux Other ViaLogy EmployeesOther ViaLogy Employees
SoCalBSISoCalBSI ViaLogyViaLogy
ReferencesReferences Hosack DA, Dennis GJ, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of Hosack DA, Dennis GJ, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of
genes with EASE.Gengenes with EASE.Genome Biolome Biol 2003, 4:R70. 2003, 4:R70. Leslie M. Cope, Irizarry RA, Jaffee HA, Wu J, Speed, TP. A benchmark for Affymetrix GeneChip expression Leslie M. Cope, Irizarry RA, Jaffee HA, Wu J, Speed, TP. A benchmark for Affymetrix GeneChip expression
measures. measures. BioinformaticsBioinformatics 2004;20:323–331 2004;20:323–331 Long, A.D., Mangalam, H.J., Chann, B.Y.P., Tolleri, L., Hatfield, G.W., and Baldi, P. (2001) Improved Long, A.D., Mangalam, H.J., Chann, B.Y.P., Tolleri, L., Hatfield, G.W., and Baldi, P. (2001) Improved
statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. The Journal of Biological Chemistry 276(23):19937-19944.framework. The Journal of Biological Chemistry 276(23):19937-19944.
Nitin Jain, Jayant Thatte, Thomas Braciale, Klaus Ley, Michael O'Connell, Jae K. Lee: Local-pooled-error Nitin Jain, Jayant Thatte, Thomas Braciale, Klaus Ley, Michael O'Connell, Jae K. Lee: Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 19(15): 1945-1951 (2003)Bioinformatics 19(15): 1945-1951 (2003)
data normalization and statistical validation. Comparative and Functional Genomics data normalization and statistical validation. Comparative and Functional Genomics Comp Funct Genom Comp Funct Genom 2003; 4: 442–446.Conference review2003; 4: 442–446.Conference review
Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response, ionizing radiation response, PNASPNAS, 98, 5116-5121, 98, 5116-5121