Week 3 - Thurs 27th Jan 9.00am - 12.00 Mathematical models and statistical methods in molecular biology Julia Brettschneider In the last two decades, molecular biology research has made enormous progress. Mathematical models can play a crucial role in describing the complex structure of the new discoveries in this field. The acceleration of the scientific research has been enabled by new high-throughput molecular measurement technologies. The resulting new kinds of high-dimensional data sets require the development of appropriate statistical methods. [...] This seminar will give an overview of central topics in modern molecular biology and of the role that mathematics and statistics can play in this discipline. It will cover many examples including the discovery of genes linked to certain phenotypes, multiple testing, gene networks, data quality and molecular diagnostics.
135
Embed
Mathematical models and statistical methods in molecular ... · Week 3 - Thurs 27th Jan 9.00am - 12.00 Mathematical models and statistical methods in molecular biology Julia Brettschneider
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Week 3 - Thurs 27th Jan9.00am - 12.00
Mathematical models and statistical methods in molecular biologyJulia Brettschneider
In the last two decades, molecular biology research has made enormous progress. Mathematical models can play a crucial role in describing the complex structure of the new discoveries in this field. The acceleration of the scientific research has been enabled by new high-throughput molecular measurement technologies. The resulting new kinds of high-dimensional data sets require the development of appropriate statistical methods. [...]
This seminar will give an overview of central topics in modern molecular biology and of the role that mathematics and statistics can play in this discipline. It will cover many examples including the discovery of genes linked to certain phenotypes, multiple testing, gene networks, data quality and molecular diagnostics.
Outline
• Part I: Introduction to genomics
• Part II: Natural sciences & mathematical sciences
• Part III: Examples
Part IIntroduction to genomics
1. Basic molecular biology and genomics
2. Applications in biology and medicine
3. Introduction to measurement technologies
Now you are here
Procaryotes
Eucaryotes
Similarities between eucaryotes and procaryotes: cell membrane, DNA
Main differences: true nuclei, hierarchically struct. chromosomes, membrane-bound organelles
Proteins are built according to genetic information ina multi-step process.
Central Dogma of Molecular Biology
Cell
DNA
RNA
Striated muscle
Cells: same genes,
different looks
..., stone walls, roof, divided into room, glass
windows, wooden frames, hardwood doors,...
Plan
Textual description
Metaphor: Architecture
Product
Design team
Construction
Plan(different)
Textual description
(same)
Process and product are not deterministic
Product(different)
Design team
Construction
..., stone walls, roof, divided into room, glass
windows, wooden frames, hardwood doors,...
Plan
Textual description
Central Dogma of Molecular Biology
Product
Design
ConstructionRNA
DNA
Cell
The rate of a gene’s products is variable.It depends on:
Can you give specific examples?
What?
Gene expression
Examples?
The rate of a gene’s products is variable.It depends on:
- reaction to heat shock (in yeast)- yeast cell cycle- liver tissue vs. brain tissue- tumor tissue vs. healthy tissue - brain tissues from schizophrenic person vs. control- blood with and w/o drug influence- embryo development - young vs. old
- type and state of the cell - environmental conditions - developmental stage- .....
Gene expression
Examples?
Gene expression
• Transcript (RNA) level
• Protein level
The rate of a gene’s products is variable.It depends on:
This rate is called gene expression level.It has two components:
- type and state of the cell - environmental conditions - developmental stage- .....
Gene expression
• Transcript (RNA) level
• Protein level
The rate of a gene’s products is variable.It depends on:
This rate is called gene expression level.It has two components:
- type and state of the cell - environmental conditions - developmental stage- .....
2. Applications
• Functional genetics and genomics
• Biological research
• Complex genetic diseases
• Measure the expression of a gene in different kinds of tissue (e.g., tumor and control)
• Infer if the gene contributes to the phenotypical differences (e.g., plays a role in tumor growth).
Functional genetics
This approach needs candidate genes
Traditional research approach:“Is gene A involved in biological
process xyz?”
Functional genomics
• Understanding cellular processes: e.g., which genes are involved in the yeast cell cycle?
• Etiology of complex genetic diseases: e.g., which genes play a role in schizophrenia?
• Effects of environmental factors (e.g., light, temp.)
Exploratory research approach:“Which genes are involved in
biological process xyz?”
This approach needs new technology
Biology
• Developmental processes
• Circadian clock
• Cell cycle
• Reaction to stimulus (e.g. temperature change, chemical)
Examples of typical areas for gene expression studies:
Complex genetic diseases
• Cancers, brain disorders, cardio-vascular diseases, asthma, some skin diseases, some birth defects, etc. have a genetic component, among other factors
• Detect differentially expressed (patients vs controls) genes and discover pathways
• Test reaction to treatments on molecular level
• Refine disease subtypes
• Tailor treatment type to individual
• Develop diagnostic and prognostic tests based on individual molecular information
Southern blot analysis of independent Ura- lacZ- mutants isolated from wild-type and mft1D cells. DNA from yeast transformants carrying mutant pCM184-LAUR plasmids was isolated from overnight cultures in SC and used for Southern analysis by digestion with HindIII, electrophoresis, and hybridization with radiolabeled pCM184-LAUR, or for sequence analysis of different fragments of lacZ and URA3 amplified by PCR.
Microarrays• First high-throughput technology for gene expression
measurement (1990s)
• Opened up new avenues of research: functional genomics
• Assay tens of thousands of genes simultaneously in one experiment
• By now: a number of different platforms made by companies such as Agilent and Affymetrix
• Quality has been improving, prices vary accordingly
33
Nucleic acid hybridization: here DNA-RNA
34
Before Hybridization
Array 1 Array 2
Sample 1 (labelled) Sample 2 (labelled)
35
After Hybridization
Array 1 Array 2
4 2 0 3 0 4 0 3
36
Microarray
24µm
Millions of copies of a specificoligonucleotide probe synthesized in situ (“grown”)
Image of Hybridized Probe Array
>200,000 differentcomplementary probes
Single stranded, labeled RNA targetOligonucleotide probe
* **
**
1.28cm
GeneChip Probe ArrayHybridized Probe Cell
Compliments of D. Gerhold
Part IINatural sciences &
mathematical sciences
David Hilbert, 8 September 1930 in Königsberg/KaliningradCongress of the Association of German Natural Scientists and Medical Doctors, Speech “Naturerkennen und Logik”:
The instrument that mediates between theory and practice, between thought and observation, is mathematics; it builds the bridge and makes it stronger and stronger...
David Hilbert, 8 September 1930 in Königsberg/KaliningradCongress of the Association of German Natural Scientists and Medical Doctors, Speech “Naturerkennen und Logik”:
The instrument that mediates between theory and practice, between thought and observation, is mathematics; it builds the bridge and makes it stronger and stronger.
Euler proved that it was not possible to plan a route that would cross each of the seven bridges of Königsberg exactly once, whether or not you ended up in the same place as you began.
David Hilbert, 8 September 1930 in Königsberg/KaliningradCongress of the Association of German Natural Scientists and Medical Doctors, Speech “Naturerkennen und Logik”:
The instrument that mediates between theory and practice, between thought and observation, is mathematics; it builds the bridge and makes it stronger and stronger. Thus it happens that our entire present day culture, to the degree that it reflects intellectual achievement and the harnessing of nature, is founded on mathematics. GALILEO said long ago: Only he can understand nature who has learned the language and signs by which it speaks to us; this language is mathematics and its signs are mathematical figures. KANT declared, “I maintain that in each natural science there is only as much true science as there is mathematics.” In fact, we don’t master a theory in natural science until we have extracted its mathematical kernel and laid it completely bare. [...] We must know. We will know.
5. Repeat 3./4. to build theory, i.e. framework for explaining observations and making predictions http://physics.ucr.edu/~wudka/Physics7/Notes_www/node6.html
Establishing knowledgeEpistemiology Stat test (concept) Stats (math technique)
Theory A hypothesis Stochastic model
EvidenceMeasurements on
a sample Data
Judgement
Decision based on likelihood of
hypothesis given measurements
Comparison of probabilities for
observing the data in model
Discussion Quality of decision Error rates
Establishing knowledgeEpistemiology Stat test (concept) Math technique
Theory A hypothesis H0 (and Ha)
EvidenceMeasurements on
a sampleTest statistics
(function of data)
Judgement
Decision based on likelihood of
hypothesis given measurements
Probabilities for observing the data
under H0, Ha
Discussion Quality of decision Error probabilities
Hypothesis testing paradigm
Epistemiology Stats (concept) Math technique
Theory A hypothesis H0 (and Ha)
EvidenceMeasurements on
a sampleTest statistics
(function of data)
Judgement
Decision based on likelihood of
hypothesis given measurements
Probabilities for observing the data
under H0, Ha
Discussion Quality of decision Error probabilities
Example: z-test Distributions of test statistics (function of data) under H0 and Ha.Decision rule based on critical value
= probability I reject H0 though it’s true (false positive)= probability I do not reject H0 though Ha is true (false negative)
Massive parallel genomic measurement technologies
• DNA: sequencing
• RNA: qtPCR, microarrays (see below), mRNA-Seq
• Proteins: mass spectronomy, protein arrays
Technologies that can assay large numbers of macromolecules simultaneously in one experiment
Examples:
Sandrine Dudoit Multiple Testing Procedures with Applications to Genomics MCP 2007
Multiple Hypothesis Testing Problems in Genomics
! !
!"# $"# %&'()*+
!"#$%&"'(!')$ !"#$%*#!')$
"#$%#&'(&)
*++#,-./
*.()&,#&0
12/.3)#&/
*&&3040(3&5!673&8(&093&:;<
13./,39=2(+,+
1402>4/+
"09%'0%9#!=9#?('0(3&
;94&+'9(=0!.#@#.+
*.()&,#&0
12/.3)#&/
"09%'0%9#!=9#?('0(3&
1402>4/+
%,)+'(-.)/012*'3'4*56316+7153*+*5631'8(5'9)/
2*'3'4*56316++'(6(*'+19)(676(6
:';6&*6()/0<+;*&'+9)+(=1(&)6(9)+(
>)+'(-.)/
Figure 1: Biomedical and genomic data.
July 6, 2007 c�Copyright 2007, all rights reserved Page 7
Sandrine Dudoit Multiple Testing Procedures with Applications to Genomics MCP 2007
Multiple Hypothesis Testing Problems in Genomics
! !
!"# $"# %&'()*+
!"#$%&"'(!')$ !"#$%*#!')$
"#$%#&'(&)
*++#,-./
*.()&,#&0
12/.3)#&/
*&&3040(3&5!673&8(&093&:;<
13./,39=2(+,+
1402>4/+
"09%'0%9#!=9#?('0(3&
;94&+'9(=0!.#@#.+
*.()&,#&0
12/.3)#&/
"09%'0%9#!=9#?('0(3&
1402>4/+
%,)+'(-.)/012*'3'4*56316+7153*+*5631'8(5'9)/
2*'3'4*56316++'(6(*'+19)(676(6
:';6&*6()/0<+;*&'+9)+(=1(&)6(9)+(
>)+'(-.)/
Figure 1: Biomedical and genomic data.
July 6, 2007 c�Copyright 2007, all rights reserved Page 7
New generation of technologies opens up new avenues of scientific research
• Understanding routine cellular processes, e.g. which genes are involved in the yeast cell cycle?
• Etiology of complex genetic diseases, e.g. which genes play a role in schizophrenia?
• Molecular diagnosis, e.g. tumor classification by gene expression signatures
For example:
Establishing knowledgeEpistemiology Stat test (concept) Stats (math technique)
Theory A hypothesis Stochastic model
EvidenceMeasurements on
a sample Data
Judgement
Decision based on likelihood of
hypothesis given measurements
Comparison of probabilities for
observing the data in model
Discussion Quality of decision Error rates
All possible hypotheses
Establishing knowledgeEpistemiology Stat test (concept) Stats (math technique)
Theory A hypothesis Stochastic model
EvidenceMeasurements on
a sample Data
Judgement
Decision based on likelihood of
hypothesis given measurements
Comparison of probabilities for
observing the data in model
Discussion Quality of decision Error rates
All possible hypotheses
Advantage:Unprejudiced
Objection:Hypotheses generated mechanically (lack of scientific prioritisation, burry truth in rubbish)
Hypothesis testing paradigm
Epistemiology Stats (concept) Math technique
Theory A hypothesis H0 (and Ha)
EvidenceMeasurements on
a sample Test statistics
Judgement
Decision based on likelihood of
hypothesis given measurements
Probabilities for observing the data
under H0, Ha
Discussion Quality of decision Error probabilities
many of these
more theory needed here(see Ex. 1)
Part IIIExamples
• Virus sequence variation & hypothesis testing
• Gene expression measurement technologies & data preprocessing and quality
Virus example: DNA sequence variation and
HIV replication capacity(from Dudoit et al, 2007)
• Phenotype: Severity of disease as reflected by its replication capacity (RC) (see fig)
• Genotype (DNA): Codons/amimo acids in protease (PR) and reverse transcriptase (RT) regions
Question: Is replication capacity related to sequence variation?
HIVHIV live
cycle
Virus example: DNA sequence variation and
HIV replication capacity(from Dudoit et al, 2007)
• Phenotype: Severity of disease as reflected by its replication capacity (RC) (> fig)
• Genotype (DNA): Codons/amimo acids in protease (PT) and reverse transcriptase (RT) regions
Question: Is replication capacity related to sequence variation?
Sandrine Dudoit Multiple Hypothesis Testing, Part III PB HLTH 240D – Spring 2007
HIV-1 Sequence Variation and Viral Replication Capacity
Data. The HIV-1 dataset comprises n = 317 records, linking viral
replication capacity with PR and RT sequence data, from
individuals participating in studies at the San Francisco General
Hospital and the Gladstone Institute of Virology and Immunology.
The data for each of the n = 317 patients consist of the following.
• Y , a continuous replication capacity outcome/phenotype.
• X = (X(m) : m = 1, . . . ,M), an
M = 96 + 186 = 282-dimensional covariate vector of binary
codon genotypes in the PR (pr4–pr99) and RT (rt38–rt223)
HIV-1 regions. Codons are recoded as binary covariates, with
value of 0 corresponding to the wild-type codon, i.e., the most
common codon among the n = 317 patients, and value of 1 for
mutant codons, i.e., all other codons.
March 19, 2007 Page 65
Data (from Segal et al, 2004)
Naive approach• For each of the 282 codons m, test whether RC Y is
associates with X(m).
• For each of the 282 codons m, get rejection or non-rejection, along with error probabilities (significance levels).
• Problem: For each of the 282 codons, you make a wrong decision with a small probability, and that accumulates.
Naive approach• For each of the 282 codons m, test whether RC Y is
associates with X(m).
• For each of the 282 codons m, get rejection or non-rejection, along with error probabilities (significance levels).
• Problem: For each of the 282 codons, you make a wrong decision with a small probability, and these accumulate.
Excursion: Multiple testing
A historical sourceAntoine Augustin Cournot (1801-1877, French philosopher, economist, mathem. game theorist) wrote in 1843:
“... it is clear that nothing limits ... the number of features according to which one can distribute [natural events orsocial facts] into several groups or distinct categories.”
His example: investigating the chance of male birth.
“One could distinguish first of all legitimate birth from those occuring out of wedlock,... one can also classify births according to birth order, according to the age, profession, wealth, or religion of the parents... .”
A historical source (ct.)
“Usually, these attempts through which the experimenter passed don’t leave any traces; the public will only know the results that has been found worth pointing out;
... and as a consequence, someone unfamiliar with the attempts which have led to this result completely lacks a clear rule for deciding whether the result can or can not be attributed to chance.”
Generation of multiple hypothesis via subpopulations.
Bonferroni method: Reject any hypothesis with (unadjusted) p-value less or equal to !/M
Adjusted p-values:
Conservative adjustment. More sophisticated methods:Holm (1979) uses order of raw p-values,Westfal & Young’s (1993) step-up and step-down methods use order and joint distribution of raw p-values.
p̃j = min(Mpj , 1)
Controls FWER (proof is straightforward).
Multiple hypothesis in genomics?• Very common
• Number of hypothesis much bigger than in traditional mult. testing ex.
• Ex. virus RC: 282 hypothesis
• Ex. genome-wide study to find disease related genes: tens of thousands of hypothesis
• Traditional adjustment methods not useful
• Traditional concepts of error rates too narrow
Massive multiple testing -new research branch
• Grew quickly, mostly motivated by genomics (but also fMRI, astronomical images)
• Introduction of a variety of concepts for joint error probabilities
• Methods to compute distributions under Null
• New adjustment methods, new concepts
• e.g. Dudoit & van der Laan 2002+
Sandrine Dudoit Multiple Testing Procedures with Applications to Genomics MCP 2007
Shewhart (1927) explains the standpoint of the applied scientist:
''He knows that if he were to act upon the meagre evidence sometimes available to the pure scientist, he would make the same mistakes as the pure scientist makes in estimates of accuracy and precisions. He also knows that through his mistakes someone may lose a lot of money or suffer physical injury or both. [...] He does not consider his job simply that of doing the best he can with the available data; it is his job to get enough data before making this estimate.''
Microarray technology has migrated from basic sciences to medical research. This implies changed data quality needs.
Classical QA/QC
• Original motivation: Mass production of manufactured items
• Shewhart: Pioneer of statistical approach to QA/QC (1920s, 1930s)
NUSE :=Note:Normalisation because of heterogeneity in # effective probes
1!"
Wk
medk!1!"
Wk!
2. Relative Log Expression (RLE):
Median Chip: median expression over all arrays (gene by gene)RLE (gene A) in array k = log ratio gene Aʼs expression in array k and gene Aʼs median expression
Idea: use RLE distribution for quality assessment (QA).
Interpretation
Biologic assumptions
(A) majority of genes similar between different samples (B) # upregulated genes = # downregulated genes
Then, good quality is indicated by:
Med(NUSE)=1 small IQR(NUSE) Med(RLE)=0 small IQR(RLE)
Note: interpretation of the distributions rather than individual probe or probe set quantities
MLL - weightsNUSE
Weights
The MAS 5.0 quality report
• Background: No range. Key is consistency.
• Raw Q (Noise): Between 1.5 and 3.0 is ok.
• Percent present calls: Typical range is 20-50%.
• Scaling factor: Should be kept below 10. Key is consistency across arrays being analyzed.
• 3’/5’ ratios for GAPDH, BetaActin: less than 3.
• At 1.5 pM bioB should be called Present 70% of the time.
MLL 1
Median NUSE vs Affy quality report measures
Red pointer indicates the low quality chip.Affy quality report scores all in normal range.
medNUSE
%P
Noise
Scale factor
3’/5’
Is that really a bad array?
Confirmation: bias and spread in RLE
Case studies
17
Pritzker gender study data
• Tissue: Human brain regions dorso lateral prefrontal cortex (DLPFC) and cerebellum (CB)
• Hybridizations: 14 samples, hybridized in two labs (Irvine and Michigan), some data missing.
• Platform: Affymetrix HU95A.
Cerebellum
orange = Irvine, violet = Michigan
Lab difference in quality
Reason: insufficiently calibrated scanners.
Pritzker mood disorder data
• Tissue: Human brain regions from bipolar, major depression and control patients.
• Hybridizations: each hybridized in two lab out of 3s (Irvine, Davis and Michigan).
• Platform: Affymetrix HU133A,B.
• Done after gender study, after enormous effort to unify the conditions in the different labs.
Observation: Improved, but still noticable differences between labs.
Note: the boxplots are now of the the distribution of chip level quality scores (e.g. IQR(NUSE)) of all the chips in an experiment.
For about 100 years Drosophila has been used as a model organism in genetic analyses. In fact much of our current knowledge on genes, development and genetic interactions originates from work with this system.One reason to choose Drosophila is its short life cycle.
Molecular diagnosisDecisions under uncertainty, risk interpretation/communication
Some of the tasks Some of the methods
David Hilbert:
The instrument that mediates between theory and practice, between thought and observation, is mathematics; it builds the bridge and makes it stronger and stronger.
Physics a century ago - biology today
Statistical mechanics Statistical genomics
Probabilistic models for description of a particle system on the microscopic level (including interactions)
Models for quantitative description of the interactions between molecules (genes & their products)
Thermodynamic limit: fluctuations become negligible, obtain macroscopic parameters
Some kind of data aggregation: explains phenotype of a cell, e.g. type, stage etc.