Finding associated genes in large collections of microarrays
Dec 22, 2015
Finding associated genes in large collections of microarrays
Produce hypothesis of functional relations between genes
• Positive correlation: Co-regulated genes or positive modulator
• Negative correlation: Co-regulated genes or inhibitor.
• Used to derive networks of gene interactions.
4 simple ways of finding association
• Pearson correlation coefficient.
• Spearman’s rank correlation coefficient.
• Probabilistic approach (Present/Absent).
• Mutual information (Present/Absent)
Pearson correlation coefficient
• Varies between -1 and 1:Between 0.6 and 1: strong positive correlation.
Between -0.6 and -1: strong negative correlation.
-1 is perfect negative correlation
1 is perfect positive correlation
• Assumes linear relation between variables.
Pearson correlation coefficient
• Step 1: Prepare data.
• Step 2: Compute Pearson coefficient between pairs of probes of interest.
• Step 3: Assess significance.
• Step 4: Multiple testing correction.
Pearson correlation coefficient
• Step 1: Prepare data:– Chips are normalized with MAS 5.0 or
other procedure.– Scale probes in each chip dividing by
mean.– Center and standardize each probe
distribution: z-scores.
Pearson correlation coefficient
• Step 2: Compute Pearson coefficient between pairs of probes:
when z-scores are pre-computed:
n: number of chips
1nzz yx
Pearson correlation coefficient
• Step 3: Assess significance:– Randomize if possible. Good for less than 20 chips or– Use t-Student distribution with n-2 degrees of
freedom:
ρ: correlation coefficient
n: number of chips
2)1( 2
n
t
Pearson correlation coefficient
• Step 4: Multiple testing correction
Spearman’s rank correlation coefficient
• Non parametric method: – Less power but more robust.– Does not assume normal distribution.
• Also varies between -1 and 1
Spearman’s rank correlation coefficient
• Step 1: Prepare data.
• Step 2: Compute Spearman’s rank correlation coefficient between probe of interest and the rest.
• Step 3: Assess significance.
• Step 4: Multiple test correction.
Spearman’s rank correlation coefficient
• Step 1: Prepare data:– Same as Pearson.– Order the values of the probes by
increasing hybridization values.– Construct the rank vectors.
Spearman’s rank correlation coefficient
• Step 2: Compute coefficient between probe sets of interest:
d: differences between the ranks of the two probes
n: number of chips
16
12
2
nn
d
Spearman’s rank correlation coefficient
• Step 3: Assess significance: Same as Pearson.– Randomize if possible. Less than 20 chips
or– Use t-Student distribution with n -2 degrees
of freedom:
ρ: correlation coefficient
n: number of chips
21 2
nt
Spearman’s rank correlation coefficient
• Step 4: Multiple testing correction.
Binary probabilistic approach based on Present/Absent
• Approach adapted from:
“Computational methods for the identification of differential and coordinated gene expression.”
Claverie JMHum Mol Genet. 1999;8(10):1821-32
• Use MAS 5.0 calls of Present-Marginal-Absent for each probe.
• Good for heterogeneous microarray collections.
Binary approach based on Present/Absent
• Step 1: Prepare data.
• Step 2: Compute p-value of # of observed matches.
• Step 3: Multiple test correction.
Binary approach based on Present/Absent
• Step 1: Obtain P/M/A calls for probes:– Each call is associated to a p-value. Filter
can be applied.– Codify P/M/A calls as binary vectors:
Encode P as 1 and M/A as 0
Binary approach based on Present/Absent
• Step 2: Compute p-value of # of matches
probe x: 1 1 0 0 0 1 1 0 1 0 0 0
probe y: 1 1 0 0 0 0 1 0 1 0 0 0
probe z: 0 0 1 1 1 1 0 0 0 1 1 1
Find improbably high number of matches (or miss-matches).
probe x & y: 11 out of 12 matches
probe x & z: 11 out of 12 miss-matches
Binary approach based on Present/Absent
• Step 2: Compute probability for observing by chance x matches or more from the binomial distribution B(n,p). First, probability of a match.
xp : fraction of 1s (Present) probe x.
yxyxmatch ppppp 11
yp : fraction of 1s (Present) probe y.
Binary approach based on Present/Absent
• Step 2: Compute probability for observing by chance x matches or more from the binomial distribution:
• For n large one can use the normal distribution:
matchpnB ,n: number of chips.
5matchnp 51 matchpn
matchmatchmatch pnpnpN 1,
Binary approach based on Present/Absent
• Step 3: Multiple test correction.
Mutual information based on Present/Absent
• Step 1: Prepare data.
• Step 2: Compute MI value for pairs of probes.
• Step 3: Use of a threshold for MI
Mutual information based on Present/Absent
• Step 1: Obtain P/M/A calls for probes:– Each call is associated to a p-value. Filter
can be applied.– Codify P/M/A calls as binary vectors:
• Encode P/M as 1 and A as 0 OR • Encode P as 1 and M/A as 0
Mutual information based on Present/Absent
• Step 2: Compute MI value for probes X and Y:
p(.) frequencies of observed Ps and As
p(x,y) frequencies of the joint distribution
Mutual information based on Present/Absent
• Step 3: Use a threshold: probes X and Y are correlated if:
MI(X, Y) >1/n * log(1/P) n: number of chips.
P: 1/p^2 (with p number of probes).
“A simple method for reverse engineering causal networks”
M. Andrecut and S. A. Kauffman
J. Phys. A: Math. Gen. 39 No 46.
Try Pearson method in Stembase!
Implemented by Reatha Sandie