Quantitative analysis of single gene effects Gregory Carey ...ibgcarey/QAOSGEdir/SingleGeneAnalysis.pdf · Analysis of single gene effects 1 Quantitative analysis of single gene effects

Analysis of single gene effects 1

Quantitative analysis of single gene effects

Gregory Carey, Barbara J. Bowers, Jeanne M. Wehner

From the Department of Psychology (GC, JMW) and Institute for Behavioral Genetic

(GC, BJB, JMW) University of Colorado, Boulder, CO, USA.

Supported by Grant 2 M01 RR00051 from the General Clinical Research Center Program

of the National Center for Research Resources, National Institute of Health, AA-11275,

AA-03627, and R.C.A. AA-00141 to JMW.

Abstract

Traditionally, the analysis of a quantitative phenotype as a function of a two allele

polymorphism at a single locus involves a oneway ANOVA with locus as the ANOVA

factor and the three individual genotypes (e.g., aa, Aa, and AA) as the levels of the factor.

Here, we propose a related method of analysis that is just as easy to implement as the

oneway ANOVA but has the following advantages: (1) greater statistical power; (2)

direct estimation of meaningful genetic parameters such as additive genetic variance at

the locus and both broad and narrow heritability due to the locus; (3) provides estimates

of effect size that also equal meaningful genetic parameters; (4) gives a direct test for

dominance that avoids the problems with post hoc ANOVA tests; (5) yields an omnibus

F test that is identical to that of the oneway ANOVA. The method is illustrated with data

on the protein kinase C g locus in the mouse and behavior on the elevated plus-maze.

Power calculations are also provided.

Many designs involving genetics and neuroscience involve the comparison of the

three genotypic means for a locus with various mechanisms used to control for (or to

estimate) differences in background genetic variation. A simple and valid statistical

technique for this type of problem is a oneway ANOVA in which genetic locus is the

ANOVA factor and the individual genotypes at the locus are the three levels of the factor.

Most planned experiments of this type strive to achieve equal sample sizes for the

three cells in the ANOVA. While the resulting orthogonal design has attractive

properties, it presents a problem for calculating quantities like narrow-sense heritability

for the locus that may be of use to the researcher. Many traditional texts on quantitative

genetics deal with panmictic populations and derive their formulas for heritability

assuming, say, Hardy-Weinberg equilibrium. When an ANOVA design has equal sample

sizes, then such formulas are inappropriate.

Here, we present a simple method of analysis for single-gene problems that has

several advantages over a oneway ANOVA but is just as easy to implement. The first

advantage is that is statistically more powerful than the oneway ANOVA. The second

advantage is that it calculates and tests the significance of quantitative parameters such as

broad and narrow sense heritability due to the locus. (Note well that in this paper, the

term “heritability” applies only to the variance explained by the locus under question and

not to all polymorphic loci that influence the trait). A third advantage is that results can

be presented in terms of effect sizes instead of p values, a trend that is recommended in

contemporary statistics (Cohen, 1988). A fourth advantage is that, in addition to the

information given above, it also provides an F test identical to that of a oneway ANOVA.

A fifth advantage is that there is no need for post hoc tests to assess for the presence of

dominant gene action.

Analytical proof of these statements (and other assertions to follow) is provided

by a technical document at http://ibgwww.colorado.edu/~carey/QAOSGE.html, and in

some standard statistical texts such as Judd and McClelland (1989). The model also

applies to observational data gathered on organisms in their natural habitat where

genotypic frequencies are not equal (see the above website for the general quantitative

equations). In this paper, however, we deal with the specific case of equal genotypic

frequencies.

Mathematical Model

The model used here assumes two alleles per locus (which we designate as A and

a), giving three genotypes—AA, Aa, and aa. The overall notation is similar to that used

in Falconer and Mackay (1996) and is presented in Table 1.

[Insert Table 1 about here]

Here m is the midpoint between the two homozygotes. The quantity a is the

additive genetic effect and has the following interpretation: if we were to substitute allele

A for allele a in a genotype, then we expect, on average, a phenotypic change of a units.

The quantity d is the parameter for dominance. It measures the extent to which the mean

of the heterozygote Aa deviates from the average of the two homozygotes. When d = 0,

there is complete additivity.

Data Arrangement

The arrangement of data for the analysis is illustrated in Table 2. In addition to a

column for genotype and one for phenotypic scores (the values of which are fictitious in

this table), two new quantitative variables are created. The first of these is called “alpha”

in Table 2 and it is used to obtain an estimate of the additive effect of an allelic

substitution and also an estimate of narrow-sense (i.e., additive) heritability. The rule for

constructing variable alpha is simple. Alpha equals –1 for one homozygote, equals 1 for

the other homozygote, and equals 0 for the heterozygote. (Hint: let alpha equal –1 for the

homozygote with the lower mean.) The second variable is called “delta” and it is used to

assess the presence of dominance. If the genotype is a heterozygote, then the value of

delta is 1; otherwise, delta equals 0.

Sequence of Analysis

After variables alpha and delta are constructed, two regressions are performed.

The first regression, referred to by some authors as the compact model (Judd &

McClelland, 1989), uses the phenotypic score as the dependent variable and variable

alpha as the independent variable. The second, termed the augmented model, uses both

variables alpha and delta as the independent variables.

Interpretation of the Output

The squared multiple correlation (R2) from the first regression is an estimate of

the narrow sense heritability or the proportion of additive genetic variance to phenotypic

variance for the locus. Multiplying this R2 by the phenotypic variance gives the additive

genetic variance that this locus contributes to the trait. If there is no dominance

(discussed below), then the regression coefficient for the variable alpha is a direct

estimate of the additive effect of an allelic substitution (i.e., the quantity a in Table 1).

Also, the test of significant for this coefficient is always the most powerful statistical test

for genetic effects provided dominance is not strong.

The regression for the augmented model tests for dominance. If we reject the null

hypothesis of no dominance, then this regression gives additional important quantities;

otherwise, we return to the first regression and present and interpret those results. The

intercept from this multiple regression model equals the midpoint between the two

homozygotes (i.e., the quantity m in Table 1). The regression coefficient for variable

alpha equals the additive effect of an allelic substitution (i.e., the quantity a in Table 1).

The regression coefficient for variable delta equals the deviation of the heterozygote

mean from the midpoint (i.e., the quantity d in Table 1).

The F statistic from augmented model is an omnibus F that assesses the fit of the

whole model. It, along with its p value, will be identical to the F (and that F’s p value)

from a oneway ANOVA. For the types of sample sizes available for neuroscience

research, the t test for the regression coefficient of variable alpha in either the first or

second regression is almost always a more powerful statistic for testing the presence of

genetic effects at the locus than the omnibus F. For those few cases in which the F test is

more powerful (complete dominance, large additive effect, and large sample sizes), the

maximal difference in power is about .03. However, in the rest of the parameter space,

the difference in power favoring the t statistic can be appreciable—up to .20.

The appropriate test for dominance is the significance of the t statistic for the

regression coefficient of variable delta. The p level for this t statistic will be identical to

that of an F statistic that tests whether adding variable delta significantly increased R2

over the first regression. This test for dominance will always be less powerful than the t

test for variable alpha’s regression coefficient.

The squared multiple correlation from the second regression is an estimate of

broad sense heritability or the proportion of phenotypic variance attributable to total

genetic variance (additive plus dominance variance) at the gene. Thus, the proportion of

phenotype variance attributable to dominance variance can be calculated by subtracting

the R2 from this regression from the R2 of the first regression. Multiplying this quantity

by the phenotypic standard deviation gives dominance variance in raw score units.

A numerical example

As an example, we analyze data collected and reported by Bowers et al. (2000) on

behavior on an elevated plus-maze for mice lacking the gene for the g isoform of protein

kinase C (PKCg knockouts) and their heterozygous and wild-type littermates. Two

phenotypes are analyzed, both derived from a principal components analysis of the

original variables presented in Table 1 of Bowers et al. (2000). These factors agree

almost perfectly with those reported by Rodgers & Johnson (1995) using a different

population of mice. The first phenotype is activity (marked by the total number of

entrances, and entrances into the closed arms of the maze). The second phenotype is

anxiety. Here, high scores are marked by a high percentage of time and of entrances into

the closed arm while low scores are indexed by a high percentage of time and entrances

into the open arms. Descriptive statistics for these two phenotypes are presented in Table

The values of alpha were assigned so that the PKCg knock out mice were given

the value of –1, heterozygotes a value of 0, and the wild type homozygotes, a value of 1.

Code from the Statistical Analysis System (SAS) to perform the regressions for the the

anxiety phenotype is given in Appendix 1.

The activity phenotype illustrates how the method operates for a system with only

additive gene action. The results from the first regression, presented in Table 4, should

be used to interpret whether or not the PKCg locus has an effect on activity. Here, one

could interpret either the F statistic from the ANOVA table or the t statistic testing

whether the parameter estimate for variable alpha is significantly different from 0. (Both

statistics are equivalent because with only one independent variable the F statistic is the

square of the t statistic and both will have identical p values.) Here, t = -2.17 (p = .037),

so we reject the null hypothesis of no genetic effect and conclude that the PKCg gene has

an influence on overall activity in the elevated plus-maze. The value of the coefficient

for variable alpha (i.e., our estimate of a) is -.42 indicating that, on average, a

substitution of one wild type allele for null (i.e., knock out) allele reduces activity by .42

units. Here, the estimate of a may be viewed as an effect size expressed in the metric of

the original data.

This effect size may be standardized in one of two ways. First, the estimate of

a may be divided by the error standard deviation (i.e., the square root of the error mean

square). This gives a measure of effect size favored by statisticians such as Cohen

(1988). In the present case, this gives

-.42 / .90 = -.44. Hence, the average effect of an

allelic substitution is to change activity by .44 standard deviation units. The second way

to standardize is to divide a by the phenotypic standard deviation. Because we used

scores from a principal components analysis, the phenotypic standard deviations are 1.0,

leaving a unchanged.

A second way of expressing effect size is in terms of the proportion of variance

explained. The statistic here is R2, the squared multiple correlation that will be calculated

and printed in any output. It just so happens that this quantity also equals narrow-sense

heritability or the ratio of additive genetic variance to phenotypic variance. For this

regression, the R2 is .12, implying that 12% of phenotypic variance is attributable to

additive genetic variance.

The next step is to test for dominance by regressing the activity phenotype on

both variables alpha and delta. Results are presented in Table 5. The critical statistic is

the t statistic that tests whether the parameter estimate for delta is significantly difference

from 0. Here, the value of t is .08 and its associated p value is .937. Hence, there is no

evidence for dominance on activity, and we would return and interpret the first regression

as the best model to explain the data.

The results on activity also illustrate how the present method increases power to

detect genetic effects over a oneway ANOVA. The ANOVA table from a oneway

ANOVA is identical to that presented in Table 5. Here, the F value is 2.28 and its p

value is .118, so one would not reject the null hypothesis. The typical conclusion would

be that there is no evidence that the PKCg locus influences activity in the elevated plus-

maze. On the other hand, we have seen that testing whether the coefficient for alpha

differs from 0 results in a statistically significant finding.

In summary, there is good evidence that the PKCg locus influences activity. All

gene action appears to be additive and the estimate of both narrow and broad-sense

heritability for the locus is .12. Whether an effect size of this magnitude is something

that is worthwhile pursuing is, of course, a matter that should be determined by the

substance of the problem and not the statistics.

The anxiety phenotype is used to illustrate the method when dominance is

important. The results of regressing anxiety on variable alpha are presented in Table 6.

Here, the t statistic testing whether the coefficient for alpha equals is 3.09 (p = .004), so

we conclude that there is evidence that the PKCg locus also influences anxiety. The R2

from this regression (.22) equals the estimate of narrow sense heritability for anxiety.

The results of the second regression are given in Table 7. The t statistic for the

coefficient for variable delta equals 3.25 (p = .003), so we reject the null hypothesis of no

dominance. Because there is evidence for dominance, we favor the parameter estimate

for alpha using this regression instead of that from the first regression. (The two

estimates are the same in the present case because there are equal sample sizes for the

genotypes; when sample sizes differ, the estimates of alpha may be different in the two

regressions). Here, the value of the coefficient for variable alpha is .56, implying that

substituting one wild-type allele for a null PKCg allele has the average effect of

increasing anxiety by .56 units. The value of the coefficient for variable delta (i.e., our

estimate of d) is .91. This is larger than the value for a, so we might suspect heterosis.

Let us postpone discussion of this topic to focus on interpretation of heritability.

The R2 for this second regression is .41. This is our estimate of broad sense

heritability for anxiety. In short, variation in the PKCg locus accounts for about 41% of

the variability in this anxiety measure in this population of mice. Because the R2 from the

first regression is an estimate of narrow-sense heritability, the contribution of dominance

to broad sense heritability may be found by subtracting the R2 from the first regression

from that in the second regression. This gives .41 - .22 = .19.

Should we interpret the large estimate of d as overdominance? Certainly the

mean of the heterozygote is consistent with this possibility. Most—but not all—modern

regression software allows for a direct test of this hypothesis. When there is heterosis,

then the value for d should be significantly greater than the value of a (or significantly

less than the value of –a, depending on which allele is dominant). One can test for this

by constraining the regression coefficients for variables alpha and delta to be equal and

then testing the significance of this model against the second regression given above.

One simple SAS statement is sufficient for this test (see Appendix 1).

For the present case, the test is not significant (F(1,33) = 1.14, p = .29). Hence,

the value of d is not significantly greater than that for a, and there is no evidence for

overdominance.

Calculations of power are very easy when the number of animals in each cell is

equal. Because the most powerful test is the t statistic for the regression coefficient of

variable alpha, we base power on that statistic used in the augmented model. The

numerator for the t is the hypothesized value of the regression coefficient for variable

alpha (i.e., the quantity a in the model in Table 1). The denominator is the standard error

of the regression coefficient and equals

1.5s error2

s error2 is the error variance and N is the total number of animals. The degrees of

freedom for this test equal (N – 2). With a little algebra, the noncentrality parameter for

the t may be written as

tnc =a

s error

Note that the a in this equation is the parameter on Table 1 and not the probability of a

type I error. Note also the quantity a divided by the error standard deviation is the

Cohen’s standardized effect size. Power may then be calculated by finding the area

under this noncentral t distribution that exceeds the critical value of t. A program for the

calculation of power is available at http://ibgwww.colorado.edu/~carey/QAOSGE.html.

Table 8 gives estimates of power as a function of Cohen’s (1988) effect size and

sample sizes. The calculations were made assuming complete additivity. Dominance

will increase power because error variance will be reduced, but increase is so slight that

the figures in Table 8 are close approximations, even to the case of complete dominance.

The calculations suggest that 12 animals per group will result in power of .80 or better as

long as the effect size is moderately large (i.e., .60 or greater).

Discussion

The method advocated in this paper provides greater statistical power and more

informative quantitative estimates than traditional statistical techniques for analyzing

single gene effects. It provides two different ways to communicate effect sizes, one

expressed in mean change units (i.e., the values of a and d) and the second in terms of

variance components or heritabilities (R2) at the locus.

The major liability with the technique does not lie in the mathematics. Instead, it

resides in the incorrect interpretation and extrapolation of the quantitative estimates. In

experimental organisms, heritability estimates for a single gene are usually estimated

within the context of a limited range of genetic backgrounds and in controlled

environments. These estimates should never be extrapolated to general populations or

treated and interpreted as heritability estimates derived from, say, human twin or

adoption studies. For example, the mice used for the data examples give involved the

genetic background of C57BL/6J and 129/SvEvTac inbreds and were all reared and

tested in environmentally standardized conditions. The estimates of effect size and

heritability pertain only to this genetic background and similar environments. These

estimates can be very useful in guiding which of several future experiments to pursue, in

calculating power for other studies, and perhaps even in pointing out a suitable candidate

locus for other designs. They do not, however, imply that the PKCg locus must be

involved in individual differences in anxiety in wild mouse populations.

Finally, we offer a caveat. All statements made in this paper about power are

based on equal genotypic frequencies. They will be also apply to the case where cell size

differs, but only by a few animals. The results on power should not be extrapolated to

natural populations in whom allele and genotypic frequencies can markedly differ.

References

Bowers,B.J., Collins,A.C., Tritto,T., & Wehner, J.M. (2000). Mice lacking PKCg exhibit

decreased anxiety. Behavior Genetics 30:111-121.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale NJ: L.

Erlbaum.

Falconer,D.S. & Mackay, T.F.C. (1996). Introduction to quantitative genetics, 4th edition.

Essex, England: Longman.

Judd,C.M. & McClelland,G.H. (1989) Data analysis: A model-comparison approach. San

Diego: Harcourt Brace Jovanovich.

Rodgers, R.J. & Johnson, N.J.T. (1995) Factor analysis of spatiotemporal and ethological

measures in the murine elevated plus-maze test of anxiety. Pharmacology Biochemistry

and Behavior 52: 297-303.

Table 1. Notation for the single-gene model.

Genotype Frequency

Expected

Variance

within Genotype

aa faa m – a s

Aa fAa m + d s

AA fAA m + a s

Table 2. Example of a data set arranged for single-gene analysis.

Phenotype Genotype alpha delta

8.3 AA 1 0

6.1 Aa 0 1

5.7 aa -1 0

6.5 aa -1 0

7.2 Aa 0 1

6.9 AA 1 0

Table 3. Means and standard deviations for activity and anxiety measures on an elevated

plus-maze for mice lacking the gene for PKCg (knock outs) and their heterozygous and

wild-type littermates.

Activity Anxiety

Genotype N Mean St. Dev. Mean St. Dev.

Knock Out 12 .41 .94 -.87 .93

Heterozygote 12 .02 1.09 .61 .56

Wild Type 12 -.43 .85 .26 .85

Table 4. Results on a activity from the regression model with independent variable alpha.

Source df

Sum of

Squares

Squares F p

Model 1 4.24 4.24 4.69 .037

Error 34 30.76 .90

Total 35 35.00

Parameter Value

Standard

Error t p

alpha -.42 .19 -2.17 .037

Table 5. Results on activity from the regression model with independent variables alpha

and delta.

Source df

Sum of

Squares

Squares F p

Model 2 4.25 2.12 2.28 .118

Error 33 30.75 .93

Total 35 35.00

Parameter Value

Standard

Error t p

alpha -.42 .20 -2.13 .040

delta .03 .34 .08 .937

Table 6. Results on anxiety from the regression using independent variable alpha.

Source df

Sum of

Squares

Squares F p

Model 1 7.66 7.66 9.52 .004

Error 34 27.34 .80

Total 35 35.00

Parameter Value

Standard

Error t p

alpha .57 .18 3.09 .004

Table 7. Results on anxiety from the regression using independent variables alpha and

delta.

Source df

Sum of

Squares

Squares F p

Model 2 14.28 7.14 11.38 .0002

Error 33 20.72 .63

Total 35 35.00

Parameter Value

Standard

Error t p

alpha .56 .16 3.49 .001

delta .91 .28 3.25 .003

Table 8. Power for single-gene analysis using the t test for the regression coefficient of

variable alpha in the full model: No dominance.

Effect Size = a/serror

Total N .2 .4 .6 .8 1.0

12 0.07 0.18 0.37 0.53 0.72

18 0.10 0.26 0.50 0.74 0.90

24 0.12 0.33 0.63 0.86 0.97

30 0.14 0.41 0.74 0.93 0.99

36 0.16 0.48 0.82 0.97 0.99

42 0.18 0.54 0.87 0.99 0.99

48 0.20 0.60 0.94 0.99 0.99

54 0.22 0.65 0.94 0.99 0.99

60 0.24 0.70 0.96 0.99 0.99

Appendix 1. SAS code for performing the regression analyses. Code for variable

genotype: 1 = PKCg null mutant homozygotes; 2 = heterozygotes; 3 = wild-type

homozygotes. A copy of this program is available at

http://ibgwww.colorado.edu/~carey/QAOSGE.html.

data plusmaze;

input genotype activity anxiety;

if genotype=1 then alpha = -1;

else if genotype=2 then alpha = 0;

else alpha = 1;

delta = 0;

if genotype=2 then delta = 1;

datalines;

1 0.83 -0.889

1 0.274 -0.816

1 0.869 0.025

1 0.898 -2.024

1 0.257 -1.346

1 -0.429 -1.553

1 1.061 -2.557

1 -2.093 -0.573

1 1.56 0.919

1 0.976 -0.213

1 0.497 -0.693

1 0.236 -0.699

2 -0.121 -0.174

2 -3.047 -0.396

2 0.345 0.789

2 0.739 1.234

2 0.164 1.21

2 0.561 1.118

2 0.852 -0.106

2 0.868 0.69

2 0.179 0.738

2 -0.936 0.499

2 -0.046 1.035

2 0.658 0.644

3 -0.639 1.367

3 -0.329 0.344

3 -0.746 1.32

3 -0.879 -1.053

3 0.375 0.61

3 0.667 -0.001

3 -2.484 -0.563

3 -0.479 0.144

3 -0.943 1.262

3 -0.098 -1.08

3 -0.185 0.576

3 0.584 0.212

title Bowers et al (2000) data on PKC-gamma and anxiety;

proc reg data=plusmaze;

var anxiety alpha delta;

title2 first model;

additive: model anxiety = alpha;

title2 second model;

total: model anxiety = alpha delta;

title3 test for overdominance;

test alpha = delta;

Quantitative analysis of single gene effects Gregory Carey ...ibgcarey/QAOSGEdir/SingleGeneAnalysis.pdf · Analysis of single gene effects 1 Quantitative analysis of single gene effects

Documents

Gene Jury Single Gene Cross. Living thing! Wee baldy sheep.....

Gene therapy for common single gene...

Mutations Drosophila gene: Effects - PNAS · 2005. 4....

A single gene network accurately predicts phenotypic effects...

Fluidigm Based Single-cell Gene Expression Library ...

Using R and Bioconductor to explore genetic effects on...

Molecular pathology of single gene disorders

A Single dicer Gene Is Required for Efﬁcient Gene...

Preimplantation genetic diagnosis for single gene disorders:...

CHAPTER 2 Single-Gene Inheritance

Single Gene Slide

Single gene disorders oct 2012 part 2

Single gene disorders basic concept nov 2012

Single Cell V8...10x Genomics gene expression workﬂow...

Single Gene

Single Gene Disorders