Analysis of High-throughput Gene Expression Profiling
Why to Measure Gene Expression
1. Determines which genes are induced/repressed inresponse to a developmental phase or to anenvironmental change.2. Sets of genes whose expression rises and fallsunder the same condition are likely to have arelated function.3. Features such as a common regulatory motif can bedetected within co-expressed genes.4. A pattern of gene expression may be used as anindicator of abnormal cellular regulation.
• A useful tool for cancer diagnosis
Why to Measure Gene Expression in Large Scale?
Transitional vs. High-throughput Approaches
Techniques Used to Detect Gene Expression Level
• Microarray (single or dual channel)Microarray (single or dual channel)• SAGESAGE• EST/cDNA libraryEST/cDNA library• Northern Blots• Subtractive hybridisation• Differential hybridisation• Representational difference analysis (RDA)• DNA/RNA Fingerprinting (RAP-PCR)• Differential Display (DD-PCR)• aCGH: array CGH (DNA level)
High-throughput High-throughput
Basic Information of Microarray, SAGE and cDNA Library
(DNA) Microarray1. Developed around 1987.2. Employ methods previously exploited in immunoassay co
ntext – specific binding and marking techniques.3. Two types of probes:
Format I:Format I: probe cDNA (500~5,000 bases long) is immobilized to a solid surface such as glass; widely considered as developed at Stanford University; Traditionally called DNA microarrays. Format II:Format II: an array of oligonucleotide (20~80-mer oligos) probes is synthesized either in situ(on-chip) or by conventional synthesis followed by on-chip immobilization; developed at Affymetrix, Inc. Many companies are anufacturing oligonucleotide based chips using alternative in-situ synthesis or depositioning technologies. Historically called DNA chips.
Microarray
• Single Channel: sub-type classification • Dual Channel: differential expression ge
ne screening
• Tissue microarray• Protein microarray• ……
Array CGH
• Detecting DNA copy variation via microarray approach
• A hotspot in recent research works, especially in Cancer research
Microarray Analysis
gene discovery pattern discoveryinferences about biological processesclassification of biological processes
Which genes are up-regulated, down-regulated, co-regulated, not-regulated?
SAGE
• Experimental technique assigned to gain a quantitive measure of gene expression.
• ~10-20 base “tags” are produced (immediately adjacent to the 3’ end of the 3’ most NlaIII restriction site).
• The SAGE technique measures not the expression level of a gene, but quantifies a "tag" which represents the transcription product of a gene.
SAGE
Tags are isolated and concatermized.
Relative expression levels can be compared between cells in different states.
SAGEmap (http://cgap.nci.nih.gov)
SAGE: comparing two relational libraries
EST library (UniGene)
Gene expression info from Unigene Library
An Example of In-house EST Library Analysis
The Algorithms and Challenges of High-throughput Gene Expression Analysis
Seeing is believing?
No, need to correct errors.
SAGE:
• A typical experiment requires ~30,000 gene expression comparisons where normal and a diseased cell is compared.
• The results were subject to the size and reliabilities of the SAGE libraries.
• Statistical measures are used to filter out candidate genes to reduce the dimensionality of the data but it is tedious and time consuming to play with these measures until a good set is found.
SAGE
• TPM: a simple normalization methodTPM=Count*1000,000/TotalCount
• Bayesian approach http://cancerres.aacrjournals.org/cgi/content/full/59/21/5403
Microarray: Sources of errors
• systematic• random
log
sig
nal i
nten
sity
log RNA abundance
Sources of Errors (Cont.)
• Printing and/or tip problems• Labeling and dye effects (differing amounts of
RNA labeled between the 2 channels)• Differences in the power of the two lasers (or
other scanner problems) • Difference in DNA concentration on arrays (pl
ate effects)• Spatial biases in ratios across the surface of t
he microarray due to uneven hybridization• cDNA array cannot distinguish alternatively
spliced forms
Errors that cannot be corrected by statistics
• Competitive hybridization of different targets on the chip
• Failure to distinguish different splicing forms
• Misinterpretation of time course data when there are not sufficient points
• Misinterpretation of relative intensity
Does clustered time course really mean co-expression?
Picture taken from http://genomics.stanford.edu/yeast/additional_figures_link.html
Yes, you can studyknown system (such as cell cycle) this way; but, how about the unknown systems?
Normalization by iterative linear regression
fit a line (y=mx+b) to the data set set aside outliers (residuals > 2 x s.e.)
repeat until r2 changes by
< 0.001
then apply slope and intercept to the original dataset
D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp
average signal {log2 (Cy3 + Cy5)/2}
ratio
{lo
g 2 (C
y5 /
Cy3
)} Loess function fit line
0
Normalization (Curvilinear)G Tseng et al., NAR 2001
After Normalization ……
• Differentially Expressed (DE) Gene screeing– T-test– T-statistics– SVM
• Clustering– Hierarchical– SOM– K-means
• Network (Pathway) analysis– BioCarta, KEGG, GO databases– Bayesian network learning– Topology – …
Bioinformatics challenges
1. data management2. utilizing data from multiple experiments3. utilizing data from multiple groups
* with different technologies* with only processed data
available
Bioinformatics Analysis of Integrated Analysis of Gene Expression Profiling
Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression
Daniel R. et al. PNAS, 2004(101), 9309-9314 T-test Q values (estimated false discovery rates) were calculated as
where P is P value, n is the total number of genes, and i is the sorted rank of P value.
Cont. Meta-Profiling.
The purpose of meta-profiling is to address the hypothesis that a selected set of differential expression signatures shares a significant intersection of genes (a meta-signature), thus inferring a biological relatedness.
67 genes were screened by mata-analysis
Integrated Cancer Gene Expression Map
7 genes were discovered by the system
THANX!!