10 Liu, Dajiang

Statistical Genetics Using Sequence Data

Dajiang J. Liu

Department of Statistics

Why We Study Statistical Genetics• Statistics is originated from genetics• R.A. Fisher: “The Correlation Between Relatives on the Supposition of Mendelian

Inheritance”– Introduced the concept of variance in this article

• Francis Galton: Regression of human height toward the mean:– Introduced correlation and regression

• Karl Pearson: – “Mendelism and the problem of mental defect”– “Tuberculosis, heredity and environment”

• Why don’t we seek our roots?

• In order to find disease genes in the genome, statistics is a must

Statistical Genetics

• Disease gene mapping: – The determination of the sequence of genes and their

relative distances from one another on a specific chromosome

– Technology driven field:1. Mendel’s era: Segregation Analysis

- Patience: peas, fruit fly: inbreeding is necessary

Experimental Design

Statistical Genetics• Modern era:

– Microsatellite Markers:• Genetic linkage analysis

– Extremely successful for mapping and identifying Mendelian traits

– Single nucleotide polymorphism (SNP) marker• Case control studies:

– Genome Wide Association Studies: To identify common variants involved in complex traits

ComputationalTechniques for

likelihood in Pedigrees

Statistics play a major role

Statistical Genetics• Sequencing Era:

• Study of diseases due to rare variants is emerging

ABI SOLiD sequencer

Statistics is ALL for sequencing data

Statistical Genetics

• Data we work with

Human Genome Project

Hap Map Project

1000 GenomeProject

Multi-facotorial Disease Etiology Hypothesis

• Common Disease Common Variants Hypothesis (CD/CV) hypothesis:– Common diseases are caused by a few common variants with

moderate effect– E.g. Age-related Macular Degeneration:

• Common variants are likely to have lower odds ratio than rare variants:

Multi-facotorial Disease Etiology Hypothesis

• Common Disease Rare Variants Hypothesis:– Common diseases are caused by multiple rare

variants with large effect size:– The discovery of rare variants will have high impact

on public health since they will aid in risk prediction and treatment

• E.g. Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol

• E.g. Colorectal Adenomas

Challenges on Statistical Methodologies• Variants misclassification:

– Non-causal variants Included:• Huge number of mutations on the genome:

– Most of them are not causing the disease under study

– Causal Variants Excluded:• Intronic mutations:• Intergenic regions:

• Unknown patterns of interactions:1. Within gene interactions: e.g. Hirschsprung’s disease (RET gene)2. Gene x gene interactions: e.g. breast cancer genes (BRCA 1 BRCA2 x

CHEK2)

Adaptive methods are needed

Kernel Based Adaptive Clustering• Combine variant classification with association testing into a

coherent framework• Applicable to population based case/control studies using unrelated

individuals• Robust against variants misclassifications• Can handle gene x gene interactions and gene x environment

interactions

10 Liu, Dajiang

Technology

variants misclassifications

multiple rare variants

gene gene x gene interactions

discovery of rare variants

study causal variants

statistical genetics

common diseases

disease genes