Statistical Genetics Using Sequence Data Dajiang J. Liu Department of Statistics
Statistical Genetics Using Sequence Data
Dajiang J. Liu
Department of Statistics
Why We Study Statistical Genetics• Statistics is originated from genetics• R.A. Fisher: “The Correlation Between Relatives on the Supposition of Mendelian
Inheritance”– Introduced the concept of variance in this article
• Francis Galton: Regression of human height toward the mean:– Introduced correlation and regression
• Karl Pearson: – “Mendelism and the problem of mental defect”– “Tuberculosis, heredity and environment”
• Why don’t we seek our roots?
• In order to find disease genes in the genome, statistics is a must
Statistical Genetics
• Disease gene mapping: – The determination of the sequence of genes and their
relative distances from one another on a specific chromosome
– Technology driven field:1. Mendel’s era: Segregation Analysis
- Patience: peas, fruit fly: inbreeding is necessary
Experimental Design
Statistical Genetics• Modern era:
– Microsatellite Markers:• Genetic linkage analysis
– Extremely successful for mapping and identifying Mendelian traits
– Single nucleotide polymorphism (SNP) marker• Case control studies:
– Genome Wide Association Studies: To identify common variants involved in complex traits
ComputationalTechniques for
likelihood in Pedigrees
Statistics play a major role
Statistical Genetics• Sequencing Era:
• Study of diseases due to rare variants is emerging
ABI SOLiD sequencer
Statistics is ALL for sequencing data
Statistical Genetics
• Data we work with
Human Genome Project
Hap Map Project
1000 GenomeProject
Multi-facotorial Disease Etiology Hypothesis
• Common Disease Common Variants Hypothesis (CD/CV) hypothesis:– Common diseases are caused by a few common variants with
moderate effect– E.g. Age-related Macular Degeneration:
• Common variants are likely to have lower odds ratio than rare variants:
Multi-facotorial Disease Etiology Hypothesis
• Common Disease Rare Variants Hypothesis:– Common diseases are caused by multiple rare
variants with large effect size:– The discovery of rare variants will have high impact
on public health since they will aid in risk prediction and treatment
• E.g. Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol
• E.g. Colorectal Adenomas
Challenges on Statistical Methodologies• Variants misclassification:
– Non-causal variants Included:• Huge number of mutations on the genome:
– Most of them are not causing the disease under study
– Causal Variants Excluded:• Intronic mutations:• Intergenic regions:
• Unknown patterns of interactions:1. Within gene interactions: e.g. Hirschsprung’s disease (RET gene)2. Gene x gene interactions: e.g. breast cancer genes (BRCA 1 BRCA2 x
CHEK2)
Adaptive methods are needed
Kernel Based Adaptive Clustering• Combine variant classification with association testing into a
coherent framework• Applicable to population based case/control studies using unrelated
individuals• Robust against variants misclassifications• Can handle gene x gene interactions and gene x environment
interactions