sequencing for optimal (microRNA) biomarker discovery and validation in cardio- metabolic (and renal) disease Christos Argyropoulos MD, PhD, FASN Department of Internal Medicine Division of Nephrology University of New Mexico Health Sciences Center
Apr 06, 2017
Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker
discovery and validation in cardio-
metabolic (and renal) disease
Christos Argyropoulos MD, PhD, FASNDepartment of Internal Medicine
Division of NephrologyUniversity of New Mexico Health Sciences Center
Overview• Models of sequence counts in short RNA-seq
experiments• Estimating and controlling for bias in small RNA-seq
experiments• Statistical approaches to analyzing differential expression• MicroRNA regulation – a control theory perspective• MicroRNAs as biomarkers in diabetes, renal and
cardiometabolic disease • Leveraging our approach for optimal biomarker
discovery
Signals in short RNA-seq dataBuilding a model from first principles
Background• Short RNA-seq data are becoming more and more
abundant• There is poor reproducibility of findings between
and within research groups• Systematic measurement bias confound findings
• Systematic variation relatively stable within protocols• Systematic variation unpredictable between different
protocols and platforms
• Statistical methods may be used to explore and address such biases
• Existing approaches are phenomelogical descriptions • what do model parameters stand for?• how can one best use these models?
Building a model from first principles
• Establish testable predictions that may be verified in existing datasets
• Establish correspondence between model parameters and experimental steps
• Use this model to understand and correct systematic and random bias in short RNA-seq
• Embed the model into more general frameworks for applications:
• Epidemiological• Biomarker discovery and validation• Medical diagnostics
The short RNA-seq experiment
The vendor’s view The biochemist’s view
https://doi.org/10.1093/nar/gkt1021http://www.genomics.hk/SamllRna.htm http://www.geospiza.com/Products/SmallRNA.shtml
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
Abundance in original preparation
Abundance in adapted(ligated)
sample
Abundance in PCR amplified library
Abundance in capture probes
Abundance of counts in fastq files
(ligation efficiency) fi
(number of PCR cycles) N (PCR efficiency) qi
Probability of capture si
Number of probes (K)
Library dilution factor (d)
Probability of signal generation r
Probability of sequence generation pi
, , … ,
Conceptual model of the short RNA-seq experiment (this is
what we will talk about)
Modeling the qPCR amplification reaction
• Statistics of PCR amplification• Branching (Galton-Watson) process• GW distribution only available implicitly i.e.
through simulation
• Large scale simulations to derive approximation to the GW process
• PCR literature, GW theory, martingale arguments candidate distributions
• Information theory arguments used to compute distance between GW samples and the approximate distributions
• A (truncated) Normal distribution derived at the end
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
, , … ,
Flattening the hierarchy through marginalization
Integrate sources of variations out of the model:
1. library sequence depth variation 2. PCR amplification
Final statistical model is about absolute counts • Direct modeling ≠ % of counts• Limit of approximation encompasses all possible
sample compositions• The is a truncated Normal Poisson mixture
distribution (approximated via a Negative Binomial or Linear Quadratic Gaussian family)
Model implements a Linear-Quadratic (LQ) mean-variance relationship
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
, , … ,
Distributional Regression for RNA-seq
dataLQ relationship between mean () and variance ()
• The variance and the mean have to be modelled concurrently
• Unless variance is modelled inconsistent statistics small (overoptimistic) p values
• Realm of distributional regression models (GAMLSS – Generalized Additive Models for Location, Scale and Shape)
• One can re-use existing SW frameworks to fit such models
Validating model(s) with synthetic mixes of known composition
• Allow one to test the “backbone” of the model without worrying about the adequacy of the modeling of biology
• Sequencing of equimolar mixes:• Explore and model systematic bias in the same protocol
• Sequencing of dilution series or non-equimolar mixes:
• “Dose-response” curve of the bias• Examination of “debiasing” approaches for the ability to
uncover the truth
• Model may also be used to analyze the performance of differential expression algorithms
Testable predictions: mean and variance linear quadratic
relationships in public RNA-seq data
Linear Quadratic Relationship in the legacy datasets of the Galas group
Estimating and Correcting for Ligase BiasAt the corner of Biochemistry and Mathematics
Enzymatic mechanism of RNA ligation• The kinetics of RNA ligation were investigated thoroughly
in the 1970s and early 1980s• The intermolecular reaction is relevant to RNA-seq• The mechanism involves three, fully reversible, steps that
obey ping-pong ordered kinetics and are subject to substrate inhibition
Bias in RNA-ligation was noted in these early investigations and the enzyme was never used as tool in synthetic chemistry, as solid phase methods took off in the 80s
Kinetic analysis of ligase reaction velocity in RNA-seq protocols• Existing protocols include abundant cofactors (sharp
contrast to the experiments in 1970s)Drive reaction to the right Rate limiting single step reaction instead of tri-step oneSubstrate preference (bias in reaction yields) is not eliminated
• Multi-substrate inhibition from all biosample sequences available from ligation
Analytical series approximation for ratios of random variables
• Ligase operates at the 1st order domain of Michaelis-Menten kinetics
𝑉 𝑖=𝑉 𝑖
𝑚𝑎𝑥 𝑋 𝑖
𝐾 𝑀𝑖 (1+∑
𝑖
𝑋 𝑗
𝐾𝑀𝑗 )≈
𝑉 𝑖𝑚𝑎𝑥 𝑋𝑖
𝐾𝑀𝑖 (1+𝑛 𝐸 [ 𝑋 ]
𝐸 [𝐾𝑀 ] )=𝑉 𝑖
𝑚𝑎𝑥 𝑋𝑖
𝐾𝑀𝑖 ¿¿
Testable model predictions about ligase bias in RNA-seq experiments
Mathematical expression Implications for ligase bias• Concentration independence• Sample composition
independence• Transferable within
experiments done with the same protocol
• Protocol dependent (reaction velocity incorporates concentration of cofactors and enzyme)
• Sequence equimolar mixes to derive empirical correction factors for ligase bias
• Apply those to biological samples (“offsets” in distributional regression) to eliminate bias
There is substantial variation in raw sequence counts from equimolar mixes
Application of bias factors virtually eliminates ligase bias
Monte Carlo Cross Validation in 3 equimoral datasets: randomly split the dataset into learning and testing subsets, learn the correction factor and apply it to correct the estimates of the learning dataset. Repeat N times
Empirical factors nearly eliminate bias between equimolar datasets with 10x different input (Galas Lab legacy datasets)
Bias factors in public non-equimolar short-RNA seq
datasets
Design of Validation Experiments
What has been established?• Moderate
concentration independence
• Ability to nearly eliminate bias over at least two orders of magnitude
• Legacy platforms/experiments
What needs to be proven?• Concentration
independence over >2 orders of magnitude
• Sample composition independence
• Recovery of differential expression measures
• Any value relative to existing approaches?
Validation Experiments
Collaboration between PNRI (Galas Lab) and UNM (DoIM)The largest, single protocol, technical series to date (GSE93399)
Experimental Group Dilution NmiRExplore (972 short RNAs)
1:10 10
286 miRNAs 1:1 8 1:10 8 1:100 8 1:1000 8Ratio Metric Series A (descending)
Mix of 286 subpool A (1:1) 286 subpool B (1:10) 286 subpool C(1:100) 286 subpool D (1:1000)
8
Ratio Metric Series B (ascending)
Mix of 286 subpool A (1:1000) 286 subpool B (1:100) 286 subpool C(1:10) 286 subpool D (1:1)
8
Total 7 groups (58 sequenced x 2 = 116)
Empirical bias correction over 3 orders of magnitude in equimolar
datasets
RMSE reduction: 77%-90% (input in calibration run differs by up to x10 from target), 54%-67% otherwise
Empirical factors reduce bias by nearly 60% in non-equimolar series
Bias correction recovers expression profile
patterns
Bias Correction in Heterogeneous Samples
• Correction factors remove ~55% of bias between equimolar samples
• ~ 70% of RNAs have expression within two fold from the mean (from 23%)
• Bias reduction is ~40% in ratiometric series
• ~63% of RNAs have expression within x2 from the mean (from 33%)
Differential ExpressionWhen more is less, and simplest is the best
Our proposal for a model of differential expression (DE)
changesStatistical formulation and assumptions
(similar model for variance)1. Expression in reference state is
not of prime scientific interest (can omit correction for bias)
2. Technical sources of variation (PCR efficiency, library sampling) of much smaller magnitude than biological variability
Parameter interpretation and context of use
• Accommodates global and sequence specific DE changes
• Flexible modeling of referent (global level and variation around it)
• Still models counts• No incorporation of
library specific factors (model is un-normalized)
• Number of reads in sample j, assigned to species i (Ki,j)• Assumed to follow a negative binomial distribution:
Existing Models for RNA-seq experiments
Standard deviation
= (edgeR1)
1Biostatistics 2008, 9:321-322 Genome Biology 2010, 11:R106
= (DESeq2)
Mean = Common scale(coverage of the library, sequence depth)
Experimental Effects
iijim ,1,0, )log(
miRNA expression in thecontrol group
miRNA expression in theexperimental group
Model for differential expression analysis
Comparison of proposed approach
against existing methods
“We” (gamlss)• Uses the NB or the LQNO• LQ relation between mean and
variance• Variance and mean parameters
are estimated simultaneously• Explicit count based modeling• Un-normalized• Shrinkage via random effects
modeling• Derived from first principles (a
generative probability model)
“They” (edgeR/DESeq2 etc)• NB or the linear model• LQ or flexible relation between
mean & variance• Two stage procedure to estimate
parameters• Models counts as % of a given
library depth• Normalized (% sum to one)• Shrinkage via random effects
modeling • Ad hoc, phenomenological
probability model
Scenarios of differential expression to assess method performance
• Clustered, symmetric differential expression1. fraction of overexpressed sequences is equal to that of the
underexpressed 2. no change in global expression over and underexpressed RNAs are present in equal numbers
and exhibit same degree of DE • Asymmetric, clustered differential expression
1. Fraction of overexpressed sequences ≠ underexpressedDrives global expression change to one direction
• Global Change: all RNAs exhibit a variable but consistent directional change of expression
• No change
All scenarios implemented through the validation datasets
The GAMLSS has smaller RMSE than 10 popular
workflows for DE analysis• Performance
benefit seen under scenarios of asymmetric, clustered differential expression changes
• When DE are (nearly) symmetric, many other methods have similar performance
Existing methods cannot detect global, directional differential expression
Algorithm performance in the absence of differential expression
GAMLSS demonstrates the optimal balance between False Omission and False Discovery Rates
ROC Curve Analysis FDR and FOR
What did we just find out about algorithms for DE analysis?
• Proposed method (GAMLSS) is the top performer:• Symmetric, clustered, DE changes• Asymmetric clustered, DE changes• Asymmetric global, DE changes• No DE changeOptimal balance between FDR and FOR
• Existing methods introduce moderate – to – severe bias• force the overall DE to sum to zero (what goes up must be
accompanied by something that goes down)• Voom/limma somewhat more resilient, near identical performance
to GAMLSS under symmetric DEThese patterns have not seen before, because no-one to date
has generated datasets with known composition/DE
Why do existing methods fail to deliver?
• Existing models for RNA-seq analysis e.g deSEQ, edgeR can be derived from 1st principles as approximations
• RNA-seq counts as % of library depth• Valid for dilute samples, not dominated by a few
RNA species• Library size depth and modeling counts as %
(a relic of the SAGE era) may be a disastrous distraction
• Parameterization constraints DE over all RNAs included in the analysis to sum to zero
Practical implications for experimentalists (not using GAMLSS)
• Any change to the population of RNAs modelled (e.g. filtering)→ different DE values from the same dataset
• Both type M (degree of DE changes) and type S (label an over-expressed sequence to be under-expressed & vice-versa) errors
• Up to 25% of estimated DE changes may be of the wrong direction• Up to 100% of estimated DE changes may be of the wrong
magnitude
• RNA-seq findings will fail to validate against qPCR• Reputation of RNA-seq as a semiquantitative technique of
poor reproducibility is due to statistical methodology
MicroRNA regulationA control theory perspective
microRNA biology & therapeutic applications
http://www.nature.com/nature/journal/v469/n7330/fig_tab/nature09783_F1.htmlhttp://www.nature.com/nature/journal/v469/n7330/full/nature09783.html
Control In Biological Systems Is Many-To-Many, Cooperative And Patterned
Feala JD, et al. PLoS ONE 7(1): e29374. (2012)Riba A et al PLoS Comput Biol 10(2): e1003490. (2014)
Bipartite Control Network Topologies miRNA – Transcription Factor circuits
Feed Forward Loop: master control layout in many natural and artificial control systems
How do we control things?
Predictably simple (open loop)
Error Correcting (feeback)
Model based (feed forward)
Feed forward control• Control element responds to a change in the
environment in a predefined manner• Based on prediction of plant (“what is being
controlled”) behavior (requires model of the system)
• Can react before error actually occurs (stabilizing the system, e.g. cerebellum control of balance)
• Benefits: reduced hysteresis, increased accuracy, cost-efficiency, lower “wear-tear”
Practical implications• miRNAs function as master controllers in FFLs
• biology is intrinsically NOT model free
• miRNA profiling reveals the “plant” dynamics of complex biological processes
• Emerging data suggest that sequence variation may underline (dys-)regulation
• miRNA associations are by definition causal to some aspects of a particular phenotype
• “a priori plausible” biomarkers• direct therapeutic implications
• Examination of the “plant” (targets) may have implications for microRNA research
• Context for the interpretation of microRNA changes• “Stronger” biomarker signatures
microRNAs are rational candidates for exploring paradigm shifts in biology
• Ubiquity-conservation• Breadth & width of regulation (>60% of genes)• Context-specificity (“meta-controller”)• Master Controllers in Feed Forward Loops
These arguments are not disease area specific (e.g. apply equal well to cancer or even psychiatric disease)
MicroRNAs as biomarkersRenal, Diabetes and Cardiometabolic Disease
• 8-10% of the population suffer from diabetes• 20-30% of patients with diabetes will develop evidence of
diabetic chronic kidney disease (DKD/CKD)• DKD progresses in stages of increasing proteinuria• 50% of patients with overt nephropathy will develop End
Stage Renal Disease (ESRD) within 10 years• The end result: Diabetic nephropathy is the leading cause
of ESRD, requiring dialysis or kidney transplantation accounting for 40% of cases
Facts, figures and the natural history of cardiometabolic and
renal disease in diabetes
• DKD is costly:• 40-50% of the $44B Medicare expenditures for CKD• 40-50% of the $50B total healthcare costs for ESRD
• DKD is lethal (>50% of these deaths are cardiac)
• Current therapies reduce risk by 30%• Many of the things we tried to stabilize renal function AND improve
cardiovascular disease failed miserably in trials• A paradigm change in our understanding of DKD is warranted => We
posit that miRNAs will trigger this shift• This improvement likely spread to other areas given biology of
cardiovascular disease (“extreme phenotype”)
There is a significant unmet need for therapies that stabilize progression and reduce death rates in patients with diabetic kidney disease
1Afkarian et al J Am Soc Nephrol. 2013 Feb;24(2):302-8
US1 population
No Diabetes Diabetes
No CKD 7.7% 11.5%
CKD 17.2 31.1%0 10 20 30 40 50 60
4050
6070
8090
100 Dialysis Mortality
Time (months)
% S
urvi
ving
GNDM
Why bother with microRNAs in DKD?
Heart & Vessels
• Angiogenesis• Vascular inflammation• Atherosclerosis• LVH• Vascular tone• Endothelial dysfunction
Kidney
• Water homestasis• Osmoregulation• Calcium sensing• Sodium, potassium,
acid base handling• Renin production• Renal development• Renal senescence• EMT• Collagen production
• Insulin synthesis and secretion
• Peripheral tissue sensitivity
• Hepatic glucose production
• Inflammatory gene expression
microRNAs as Minimally Invasive Biomarkers : a metrological argument
Advantages of microRNAs Circulating microRNAs•More stable in circulation than mRNAs•High expression level and low complexity compared to mRNA•Tissue specific expression•Availability of analytical platforms
Keep getting cheaper over time•Sequence conservation
Allows translation of clinical associations to animal models
Allows translation of animal models to clinical applications
Cortez et al Nat Rev Clin Oncol. Jun 7, 2011; 8(8): 467–477.
Targets of differentially expressed miRNAs in early and late stages of DN map to overlapping pathways MA v.s. NA Overt vs Normal
Pathway P-value Fraction P-value FractionSignal Transduction
Signaling by SCF-KIT 0.006 18/76 0.001 41/76Signaling by Insulin receptor 0.009 23/109 <0.001 65/109Signaling by NGF 0.016 38/212 <0.001 119/212Signaling by Rho GTPases 0.024 24/125 <0.001 71/125Signaling by ERBB4 0.027 16/76 <0.001 45/76Signaling by ERBB2 0.035 19/97 <0.001 59/97Signaling by PDGF 0.040 22/118 <0.001 67/118Signaling by VEGF 0.041 4/11Signaling by EGFR 0.044 20/106 <0.001 64/106Dowstream signaling of activated FGFR 0.038 19/98 <0.001 61/98Signaling by BMP 0.001 16/23Signaling by TGFβ 0.004 11/15DAG and IP3 signaling 0.010 20/31PIP3 activates AKT signaling 0.020 15/26RAF/MAP kinase cascade 0.031 7/10Signaling by Notch 0.036 13/23Interaction of integrin α5β3 with fibrillin 0.044 2/3Interaction of integrin α5β3 with von Willbrand factor 0.044 2/3Integrin cell surface interactions 0.024 40/85
Cell-Cell Communication 0.009 57/122Cell Cycle
G0 and early G1 0.040 12/21
Leveraging the RNA-seq analytical methodologyTo boldly go where no one has gone before (but many have tried)
Goals of a microRNA research program in cardiometabolic, renal and diabetes diseases
• Use carefully designed case-control, before-after, randomized controlled trials, and n-of-1 trials for the following goals:
1. Personalized medicine applications (diagnosis/prognosis/precision medicine)
2. Biomarker discovery (e.g. to aid trials)3. Novel Therapeutics
Animal Models
Clinical Associations
Clinical Interventions
A microRNA driven discovery process
BiomarkerDiscovery
MechanisticInsights
Therapeutics
Clinical Science, Bioinformatics, Systems Biology Driven “Reverse Translation”
Translational Science
Evidence Based Medicine
Basic Science
Ingredients for success of a microRNA regulation discovery program
Requires open-ended platforms (RNA-seq)o Especially for kidney disease due to intrarenal RNA editing
Requires unbiased quantification between groups of patients (differential expression analysis)
Requires unbiased and accurate quantification in the absence of a controlled comparison (diagnostics – bias correction)
Proposed approach: GAMLSS for RNA-seq satisfies requirements better than all currently used methods
Measurement in clinical diagnosticsWhat we want to happen What actually happens
Patient 1
10,10
Measurement is reproducibleMeasurement shows minimal inter-individual variationMeasurement shows minimal intra-individual variation
JAN
UARY
JUN
E
Condition A
JAN
UARY
JUN
E
Patient 2
10,10
Patient 3
15,15
Condition B
Patient 4
15,15
Patient 1
10,10
Condition A
Patient 2
10,10
Patient 3
15,15
Condition B
Patient 4
15,15
Patient 1
10,18
? Condition
Patient 2
13,10
Patient 3
15,10
Condition B
Patient 4
15,18
Patient 1
10,12
Patient 2
15,14
Patient 3
18,11
Patient 4
14,19
Condition A ? Condition
Condition A Condition B ? Condition
Condition BMeasurement is non-reproducibleMeasurement shows high inter-individual variationMeasurement shows high intra-individual variation
• Understand and control for the sources of variation • Use calibration sets as references• A measurement is instrument specific• Global reference standards (role for highly competent
labs that maintain the standards)• Context of use:
• Detector (“out-of-limits” readings)• Control (“track the course”)
Lessons from clinical chemistry labs
• Use GAMLSS as the prime analytical tool to analyze short RNA-seq data as it correctly represents all sources of variation and can use calibration (equimolar) runs
• Combine this with a protocol that experimentally controls variation (e.g. 4N protocol of the Galas Lab)
Measurement in experimental samplesWhat we want to happen What actually happensCondition A Condition B
10, 10, … , 10 15, 15, … , 15
B > A
Certain of the differenceMeasurement is reproducibleMeasurement shows no variation
RUN
1RU
N 2
Condition A Condition B
10, 10, … , 10 15, 15, … , 15
B > A
Condition A Condition B
11, 7, … , 10 8, 19, … , 26
B > A
Uncertain of the differenceMeasurement is non-reproducibleMeasurement shows high variation
RUN
1RU
N 2
Condition A Condition B
120, 90, … , 130 150, 60, … , 20
B < A
• Use GAMLSS as the prime analytical tool to analyze short RNA-seq data as it optimizes discovery/omission rates & exhibits the least bias
• BUT what do these correctly/unbiasedly assessed DE changes mean?
Understanding the context for differential expression changes• A list of de-regulated targets will not by itself
support the microRNA discovery process• Need some context to interpret changes and guide
further research• This context is provided by analysis of microRNA
targets• We have proposed and applied a formal target
analysis methodology in our early diabetic nephropathy investigations
Formal Target Analysis: A Biochemical Primer
1. Hill plot:
2. Fold change between two states:
3. Change in binding between the two states
4. Means and standard errors for the fold changes can be synthesized using random effects meta-analysis
5. Integration of fold changes from different experiments
dKL loglog)logit(1
log
FC
R
E
LL
2log2
2loglog)(log)logit()logit( 2 FCREORRE
• Use GAMLSS as the prime analytical tool to analyze differential expression in short RNA-seq data as it achieves the smallest error among algorithms
http://www.pdg.cnb.uam.es/cursos/BioInfo2002/pages/Farmac/Comput_Lab/Guia_Glaxo/chap3b.html
The 1st grade approach to target analysis
Heuristic Argument: count the number of miRNAs with small p values
• Total Score (TS)= # of differentially expressed miRNAs predicted to bind to a given target
• Regulation Score (RS)= # over-expressed- # under-expressed miRNAs predicted to bind to a given target
TS Low High
RS
- -
0 0
+ +
Low Signal To Noise Ratio
Target likely disinhibited
Target likely neutrally modulatedTarget likely inhibited
• Use GAMLSS as the prime analytical tool to analyze putative targets of differentially expressed microRNAs as it achieves the optimal balance between FDR/FOR
Target Analysis for PDGF-Beta in patients with overt diabetic kidney disease (DKD)
Study
Fixed effect modelRandom effects modelI-squared=0%, tau-squared=0, p=0.9656
hsa-let-7a-5phsa-let-7b-5phsa-let-7chsa-let-7d-5phsa-let-7e-5phsa-let-7f-5phsa-let-7g-5phsa-let-7i-5phsa-miR-106a-5phsa-miR-106b-5phsa-miR-122-5phsa-miR-1224-3phsa-miR-134hsa-miR-140-3phsa-miR-17-5phsa-miR-1909-3phsa-miR-1913hsa-miR-204-5phsa-miR-20a-5phsa-miR-20b-5phsa-miR-2110hsa-miR-2113hsa-miR-324-3phsa-miR-329hsa-miR-335-5phsa-miR-342-3phsa-miR-361-3phsa-miR-450b-3phsa-miR-491-5phsa-miR-501-5phsa-miR-545-3phsa-miR-558hsa-miR-603hsa-miR-608hsa-miR-663bhsa-miR-765hsa-miR-93-5p
TE
0.80-0.46-0.30 0.61 0.22 0.32 0.71 0.45 0.37 0.37-0.06 1.52 0.44 0.08 0.51 0.32 0.83 0.43-0.12 0.33 0.09 0.55 0.07 0.14 1.78-0.10 0.05 0.74 0.60-0.08-0.01 0.27-0.64 0.11-0.41 0.72-0.12
seTE
0.58930.57090.56810.63480.56360.56040.60510.64790.57210.65780.57520.71480.54140.63000.68820.52860.54300.54500.57360.79840.54510.73090.55030.54240.63240.58100.59910.61660.69920.73410.78300.53980.53100.74240.88230.58780.5416
0.2 1 2 5 15 50 150
Odds Ratio
Expression Ratio
OR
1.331.33
2.230.630.741.851.241.382.041.561.451.450.944.561.561.091.671.382.281.540.891.391.091.741.071.155.900.901.052.101.810.930.991.310.531.120.662.060.89
95%-CI
[1.09; 1.61] [1.09; 1.61]
[0.70; 7.09] [0.21; 1.93] [0.24; 2.26] [0.53; 6.42] [0.41; 3.75] [0.46; 4.14] [0.62; 6.66] [0.44; 5.56] [0.47; 4.46] [0.40; 5.26] [0.31; 2.91]
[1.12; 18.50] [0.54; 4.50] [0.32; 3.73] [0.43; 6.42] [0.49; 3.88] [0.79; 6.62] [0.53; 4.48] [0.29; 2.73] [0.29; 6.63] [0.38; 3.18] [0.41; 7.28] [0.36; 3.14] [0.40; 3.34]
[1.71; 20.39] [0.29; 2.81] [0.32; 3.39] [0.63; 7.03] [0.46; 7.14] [0.22; 3.90] [0.21; 4.60] [0.45; 3.76] [0.19; 1.50] [0.26; 4.79] [0.12; 3.74] [0.65; 6.50] [0.31; 2.57]
W(fixed)
100%--
2.8%3.0%3.1%2.5%3.1%3.2%2.7%2.4%3.0%2.3%3.0%1.9%3.4%2.5%2.1%3.5%3.4%3.3%3.0%1.6%3.3%1.9%3.3%3.4%2.5%2.9%2.8%2.6%2.0%1.8%1.6%3.4%3.5%1.8%1.3%2.9%3.4%
W(random)
--100%
2.8%3.0%3.1%2.5%3.1%3.2%2.7%2.4%3.0%2.3%3.0%1.9%3.4%2.5%2.1%3.5%3.4%3.3%3.0%1.6%3.3%1.9%3.3%3.4%2.5%2.9%2.8%2.6%2.0%1.8%1.6%3.4%3.5%1.8%1.3%2.9%3.4%
Target Gene: PDGFB
Target analysis of the NFE2L2/Nrf2 pathway in DKD
Target analysis of the TGF-beta pathway in DKD
To boldly go where no one has gone before….
Methodological• Extend the model to account
for abundance dependent variations in PCR efficiency
• Incorporate target analysis into count analysis
• Estimate ligase bias from the sequence (computationally derived correction factors)
microRNA biomarkers projects• COMPASS: a community disease
detection program focusing on diabetes and CKD in rural New Mexico
• MIRROR-Transplant: metabolic and immunological factors contributing to kidney transplant failure
• DIDIT: randomized controlled trial to preserve urine production in patients starting dialysis
• Potential areas for collaboration in the NIH biorepository?
Summary• A generative, probability, model for the counts of short
RNA-seq measurements was developed• This model may be used to estimate and substantially
correct for the presence of ligase bias• It achieves superior performance (smaller error, optimal
balance of false discoveries and omissions) than other competing methodologies
• Can be used to power “personalized” medicine applications or experimental state comparisons
• Formal target analysis to guide further research (“reverse-translation”)
Acknowledgements• This work could not have been completed without the
collaboration of the Galas Lab at PNRIDavid Galas: provided a friendly ear that had the
patience to listen, comment and risk time and funds for the experiments
Alton Etheridge: pushed for extensive sequencing and resequencing and carried out all the validation experiments
Nikita Sakhanenko: had the patient to be our software tester, validator and GEO submitter
• This work would not have started without John P (Nick) Johnson (University of Pittsburgh) who kicked me into the area about 8 years ago
https://bitbucket.org/chrisarg/rnaseqgamlss
?
Backup
Building the model from first principles
• Establish statistical distributions OR deterministic relationships that “bind” together the quantities in successive steps
• There is a “competitive qPCR” experiment beating inside each RNA-seq dataset random
• Ligase bias is reproducible deterministic/systematic
• Apply marginalization (integration) operations to “flatten” the hierarchy
• Derive the exact distributions (or the limits of approximation) for a statistical model that directly represents the quantity of interest
• Relate model parameters to quantities of interest (absolute/relative quantification)
Facts about the distribution of RNA-seq data
• Established relationships between distributions that were first explored in the 1920-1930s
• Rare biomedical applications in the 1940s• Theoretical work in the early 1960s• Lead goes cold due to failure to conceptualize
practical applications after the 1960s• Extremely involved expressions involving special
functions of mathematical physics (parabolic cylinder functions) numerical complexities will hinder attempts to use them as-is in applications
Rediscovering a Negative Binomial parameterization and introducing a new Gaussian Generalized Linear Normal Family
• Large scale numerical simulations (>500,000) to establish approximations for the RNA-seq distribution
• Arbitrary precision libraries in python in multicore machines
• Low precision – but acceptable for statistical computations
• Both approximations implement a LQ relationship between the mean and variance
• Inferences are largely the same (shown in synthetic mixes)
Two equivalent views of measures of differential expression: Fold Change and Probability of Over-Expression
• The GLM approach (limma, DESeq/DESeq2, gamlss ) yield measures of differential expression for microarrays, RNA-Seq or qPCR experiments
• These are estimates of fold changes (noise) and their associated standard errors (signal)
• They can be converted to probability estimates(= ) about the signal being >0 (overexpressed) v.s. <0
• The standard error of is given by
-2 -1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
Fold Change
EstimatedFold Change
Fold Change = 1.0, SE=1.0, shaded area (=1.0-pnorm(0,FC,SE) in R) yields probability of overexpression
Computing probability of differential expression (pDE) in R
Why do we need two views of the same data?
The FC View• Absolute, relative
quantification is possible• Fold changes in one miRNA
are directly comparable against each other
• Fold changes are comparable between and within techniques
• Type I and II statistical errors
The pDE View• Only relative, relative
quantification is possible• Platforms provide
evidence for directional changes in expression
• Type M and S errors• Provides input to Systems
Biology tools (e.g Boolean Networks)
• Experimental work in late 19th century to discover the physiological basis of coagulation (“prothrombin”)
• Development of different versions of the “Prothrombin Time”: investigations in hemophilia, post-op bleeding & liver disease (1930s-1950s): derived the normal range and ranges associated with specific deficits
• Pre-analytical considerations throughout the 1950s (and even today)
• In the 70s PT was used to monitor and dose warfarin in the clinic• Classical studies in the 70-80s demonstrate high inter, intra and
analytic variability (despite > 30 years of standardization)• WHO proposed to standardize the test in the mid 1980s through
the use of the INR (international normalized ratio)
Solid measurements for thinning one’s blood: the history of the PT test
http://www.clinchem.org/content/51/3/553.fullhttp://circ.ahajournals.org/content/19/1/92.full.pdfThromb Haemost. 1985 Feb 18;53(1):155-6.
The cautious story of the INRNormalization procedure
•
• PTnormal : Geometrical mean of 20 patients
Sources of variation• Different methods to measure
the PT• Different instruments that
implement each method• Different calibrator sets for
each instrument!
http://www.who.int/bloodproducts/publications/WHO_TRS_889_A3.pdfhttp://www.clinchem.org/content/56/10/1618.fullhttp://www.clinchem.org/content/51/3/553.full
Statistics Of Biological Regulatory Networks
Feala JD, et al. PLoS ONE 7(1): e29374. (2012)
Pathophysiology of the cardiorenal syndrome
http://www.kdigo.org/meetings_events/pdf/KDIGO%20CVD%20Controversy%20Rpt.pdf