Statistical Analysis of cDNA Microarray Data: Challenges and Solutions Toni Reverter CSIRO – Livestock Industries AAHL Seminar - 12 Dec. 2002
Jan 14, 2016
Statistical Analysis ofcDNA Microarray Data:Challenges and Solutions
Toni Reverter
CSIRO – Livestock Industries
AAHL Seminar - 12 Dec. 2002
Challenges
Time Dependent Data Dependent Human Dependent
Chronology Paradigm Skill Integration
Distribution
Source Size
Logical1800s – DATA
30-60s – METHODS
50-70s – SOFTWARE
1980s – COMPUTER
cDNA
QuantitativeComputer Sci.StatisticiansMathematicians …….
Non-QBiochemistsPhysiologistsPathologists …….
Historical Excitement Balance Interdisciplinary
AAHL Seminar - 12 Dec. 2002
EGG BANANA
“banana omelette”
Human Dependent
Challenges
Historical
•Traditionally: Statistics grew alongside Agriculture
“Introduction to Statistical Analysis”
•Nowadays: Statistics alongside (Bio)Technology
•Law of Large Numbers•Central Limit Theorem•Pythagoras Theorem
SST = SSM + SSE
d ab
Hysterical
AAHL Seminar - 12 Dec. 2002
Human Dependent
Challenges
Excitement (source of)
Eg. Always log spot intensities and ratios
T Speed. “Hints and Prejudices” •Biochemist: My software does it, therefore it’s great!•Statistician: Well, I need further evidence to be convinced
0)ln(
01
)(
x
xx
n
jj
n
jj xxx
nl
11
2)()( )ln()1()(1
ln2
1)(
Eg. Keren Byrne’s Data
AAHL Seminar - 12 Dec. 2002
Human Dependent
Challenges
Balance
•Too many Statisticians:
Evidence: It takes 1 ship, 10 days to cross the oceanQuestion: How many days does it take for 10 ships to cross the ocean?
Evidence: It takes 1 builder, 10 days to build a wallQuestion: How many days does it take for 10 builders to build a wall?
AAHL Seminar - 12 Dec. 2002
Human Dependent
Challenges
Balance
•Too many Statisticians:
PHD SCHOLARSHIPStatistical Science Program
MATHEMATICAL SCIENCES INSTITUTETHE AUSTRALIAN NATIONAL UNIVERSITY
Stipend $22,771 (2002 rate, indexed annually, tax free)
A PhD Scholarship (APAI) is being offered by the Mathematical Sciences Institute at The ANU. An ARC Linkage Grant held by Professors Peter Hall (ANU) and Don Poskitt (Monash University), in conjunction with BAE Systems, Melbourne, will fund the scholarship.
The research problem is in the area of stochastic control applied to ship motion, and involves the development and implementation of both parametric and nonparametric methods. The successful applicant will have a strong interest in statistical methodology, computational techniques, theoretical analysis, and the development of statistical research problems.
AAHL Seminar - 12 Dec. 2002
Human Dependent
Challenges
Balance
•Too many Biochemists:
Treated?No Yes
No
Yes
100
120
150
120
Die
d?
Survival Rates:
Treated = 150/270 = 55.55%
Non-Tr = 100/220 = 45.45%
Women?No
Yes
60
100
30
60
No Yes
Survival Rates:
Treated = 30/90 = 33.33%
Non-Tr = 60/160 = 37.50%12.5% Decrease!
Men?No
Yes
40
20
120
60
No Yes
Survival Rates:
Treated = 120/180 = 66.66%
Non-Tr = 40/60 = 66.66%No Difference!
AAHL Seminar - 12 Dec. 2002
22%Increase!
Human Dependent
Challenges
Balance
•Too many Biochemists:
*****
***
**
** *
**
*
*****
***
**
** *
**
*
r = 0.87
r = 0.00
r = 0.00
x
y
AAHL Seminar - 12 Dec. 2002
Human Dependent
Challenges
Interdisciplinary Skills
Minimal knowledge of the application discipline is needed
…..failing that, the Statisticians will win, ..…but with the wrong weapons.
1. Amount of Expression = Amount of Response2. Same cut-off point to judge all genes3. Over-emphasis in normalization (Thus, reject “Boutique Arrays”)4. Over-emphasis in variance stabilization
AAHL Seminar - 12 Dec. 2002
Human Dependent
Challenges
Interdisciplinary Skills
Ex.2: Ralf Moser’s Data
*****
***
**
** *
**
*
**
**
** *
*
*
*
*
% Lung Disease
Wt Gain, Kg
Ex.1: What’s a Steer?
Minimal knowledge of theapplication discipline is needed:“Animal Breeding & Genetics”
Options:1. % Gain vs. % Disease2. Medians instead of Means3. Regression coefficients*
AAHL Seminar - 12 Dec. 2002
Solutions
Disease
Wt Gain, KgO
B
A
AB
O: Control (Untreated)A: Treatment AB: Treatment BAB: Both Treatments
Model: O = A = + B = + AB = + + +
)()(. ABAAB
AABA GLogRLog
G
RLogM
estimates
The ratio:
A - AB = -( + )
AAHL Seminar - 12 Dec. 2002
Solutions
Error
M
M
M
M
M
M
M
M
M
M
M
M
BAB
ABB
AAB
ABA
AB
BA
OAB
ABO
OB
BO
OA
AO
101
101
110
110
011
011
111
111
010
010
001
001
.
.
.
.
.
.
.
.
.
.
.
.
EXM
MXXX TT 1)(ˆ
O
B
A
AB
AAHL Seminar - 12 Dec. 2002
Solutions
O
B
A
AB
O
B
A
AB
O
B
A
AB
Reference Loop All-Pairs
Variance of Estimated Effects(Relative to the All-Pairs)
Reference
1132
Loop
4/31
8/31
All-Pairs
1121
Main effect of AMain effect of BInteraction ABContrast A-B
AAHL Seminar - 12 Dec. 2002
Solutions
Probability of both Female?
Case 1. No Information …………………………1/4
Case 2. The one on the left is female …………1/2
Case 3. One of them is female ………….………1/3
AAHL Seminar - 12 Dec. 2002
Solutions
EXM
EWwGgSsXM MXXX TT ̂)(
3 Equations
MW
MG
MS
MX
w
g
s
WWGWSWXW
WGGGSGXG
WSGSSSXS
WXGXSXXX
'
'
'
'
''''
''''
''''
''''
> 35,000 Equations !
AAHL Seminar - 12 Dec. 2002
Solutions
Clever Programming Tailored to your needs
N=1
for filename in R16T0S1.gpr R16T0S2.gpr R16T24S1.gpr R16T24S2.gpr S32T0S1.gpr S32T0S2.gpr S32T24S1.gpr S32T24S2.gprdo
# Get valid readings, compute log ratios
awk 'NR>30 && $NF>=0 && $4!="no_spot" && \ substr($4,1,5)!="score" && substr($4,1,5)!="custo" && \ substr($4,1,6)!="spotre" && $9>$12 && $18>$21 \ {print $4, $9-$12, $18-$21, \ log($9-$12)/log(2.0), log($18-$21)/log(2.0)}' \ $filename | sort > junk1
awk '$2!=$3 {print $0, $4-$5, 0.5*($4+$5)}' junk1 > junk2
# get the median of log ratios
REC=`wc -l junk2 | awk '{print int($1/2)}'`MED=`sort -n +5 junk2 | awk -v rec=$REC 'NR==rec {print $6}'`echo "Median of file" $filename " = " $MED
# Global normalization: substract the median to each log ratio
awk -v median=$MED -v slide=$N \ '{print "Slide_"slide, int(slide/2+.5), $1, $6-median}' junk2 | \ sort +2 > dat.$N
N=`expr $N + 1`
done
cat dat.1 dat.2 dat.3 dat.4 dat.5 dat.6 dat.7 dat.8 > total.dat
AAHL Seminar - 12 Dec. 2002
Solutions
Clever Programming Tailored to your needs
T24 - T0
-4
-2
0
2
4
-4 -2 0 2 4
Resistant
Dise
ase
Interaction Solutions
Your Needs: “Important values are…”1. Away from (0,0)2. In quadrants 1 and 4.
Generate a new variable:
+1.0*[(R24-R0)+(S0-S24)] if R0<R24 & S0>S24
+0.5*[(R24-R0)+(S24-S0)] if R0<R24 & S0<S24
-0.5*[(R0-R24)+(S0-S24)] if R0>R24 & S0>S24
-1.0*[(R0-R24)+(S24-S0)] if R0>R24 & S0<S24
…then apply model-based clustering.
AAHL Seminar - 12 Dec. 2002
Solutions
Clever Programming Tailored to your needs
AAHL Seminar - 12 Dec. 2002
-4
-2
0
2
4
6
-4 -2 0 2 4 6
Su
sce
ptib
le
Resistant
Differential Expression T24-T0
Solutions
Clever Programming Tailored to your needs
Get to know/use all the available options
1. t-Statistics: StandardPenalised
2. Clustering: Location-Based (k-Means, …)Model-Based (Mixtures of Distributions)
3. ANOVA (Linear Models)
High
Medium
Low
Keren’s
Ralf’s
AAHL Seminar - 12 Dec. 2002
Conclusions
Statistical Analysis of cDNA Microarray Data:
GENERAL:1. Still in its infancy (…possibly even embryonic stage)2. Many decisions have a heuristic rather than a theoretical foundation3. No hope for a “One size fits all” software4. Safer to aim towards “Tailor to one’s needs”5. Integration of interdisciplinary skills is a must
LIVESTOCK SPECIES:1. Tailing humans (…at the moment)2. Strong background knowledge of genetics accumulated3. Journals will soon be inundated4. CLI has the opportunity to participate
AAHL Seminar - 12 Dec. 2002