Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease

Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker

discovery and validation in cardio-

metabolic (and renal) disease

Christos Argyropoulos MD, PhD, FASNDepartment of Internal Medicine

Division of NephrologyUniversity of New Mexico Health Sciences Center

Overview• Models of sequence counts in short RNA-seq

experiments• Estimating and controlling for bias in small RNA-seq

experiments• Statistical approaches to analyzing differential expression• MicroRNA regulation – a control theory perspective• MicroRNAs as biomarkers in diabetes, renal and

cardiometabolic disease • Leveraging our approach for optimal biomarker

discovery

Signals in short RNA-seq dataBuilding a model from first principles

Background• Short RNA-seq data are becoming more and more

abundant• There is poor reproducibility of findings between

and within research groups• Systematic measurement bias confound findings

• Systematic variation relatively stable within protocols• Systematic variation unpredictable between different

protocols and platforms

• Statistical methods may be used to explore and address such biases

• Existing approaches are phenomelogical descriptions • what do model parameters stand for?• how can one best use these models?

Building a model from first principles

• Establish testable predictions that may be verified in existing datasets

• Establish correspondence between model parameters and experimental steps

• Use this model to understand and correct systematic and random bias in short RNA-seq

• Embed the model into more general frameworks for applications:

• Epidemiological• Biomarker discovery and validation• Medical diagnostics

The short RNA-seq experiment

The vendor’s view The biochemist’s view

https://doi.org/10.1093/nar/gkt1021http://www.genomics.hk/SamllRna.htm http://www.geospiza.com/Products/SmallRNA.shtml

https://doi.org/10.1093/nar/gkt1021

http://www.genomics.hk/SamllRna.htm

http://www.geospiza.com/Products/SmallRNA.shtml

X1 , X2 , … , Xn

Λ1 , Λ2 , … , Λn

B1 , B2 , … , Bn

Y1 , Y2 , … , Yn

Abundance in original preparation

Abundance in adapted(ligated)

sample

Abundance in PCR amplified library

Abundance in capture probes

Abundance of counts in fastq files

(ligation efficiency) fi

(number of PCR cycles) N (PCR efficiency) qi

Probability of capture si

Number of probes (K)

Library dilution factor (d)

Probability of signal generation r

Probability of sequence generation pi

, , … ,

Conceptual model of the short RNA-seq experiment (this is

what we will talk about)

Modeling the qPCR amplification reaction

• Statistics of PCR amplification• Branching (Galton-Watson) process• GW distribution only available implicitly i.e.

through simulation

• Large scale simulations to derive approximation to the GW process

• PCR literature, GW theory, martingale arguments candidate distributions

• Information theory arguments used to compute distance between GW samples and the approximate distributions

• A (truncated) Normal distribution derived at the end

X1 , X2 , … , Xn

Λ1 , Λ2 , … , Λn

B1 , B2 , … , Bn

Y1 , Y2 , … , Yn

, , … ,

Flattening the hierarchy through marginalization

Integrate sources of variations out of the model:

1. library sequence depth variation 2. PCR amplification

Final statistical model is about absolute counts • Direct modeling ≠ % of counts• Limit of approximation encompasses all possible

sample compositions• The is a truncated Normal Poisson mixture

distribution (approximated via a Negative Binomial or Linear Quadratic Gaussian family)

Model implements a Linear-Quadratic (LQ) mean-variance relationship

X1 , X2 , … , Xn

Λ1 , Λ2 , … , Λn

B1 , B2 , … , Bn

Y1 , Y2 , … , Yn

, , … ,

Distributional Regression for RNA-seq

dataLQ relationship between mean () and variance ()

• The variance and the mean have to be modelled concurrently

• Unless variance is modelled inconsistent statistics small (overoptimistic) p values

• Realm of distributional regression models (GAMLSS – Generalized Additive Models for Location, Scale and Shape)

• One can re-use existing SW frameworks to fit such models

Validating model(s) with synthetic mixes of known composition

• Allow one to test the “backbone” of the model without worrying about the adequacy of the modeling of biology

• Sequencing of equimolar mixes:• Explore and model systematic bias in the same protocol

• Sequencing of dilution series or non-equimolar mixes:

• “Dose-response” curve of the bias• Examination of “debiasing” approaches for the ability to

uncover the truth

• Model may also be used to analyze the performance of differential expression algorithms

Testable predictions: mean and variance linear quadratic

relationships in public RNA-seq data

Linear Quadratic Relationship in the legacy datasets of the Galas group

Estimating and Correcting for Ligase BiasAt the corner of Biochemistry and Mathematics

Enzymatic mechanism of RNA ligation• The kinetics of RNA ligation were investigated thoroughly

in the 1970s and early 1980s• The intermolecular reaction is relevant to RNA-seq• The mechanism involves three, fully reversible, steps that

obey ping-pong ordered kinetics and are subject to substrate inhibition

Bias in RNA-ligation was noted in these early investigations and the enzyme was never used as tool in synthetic chemistry, as solid phase methods took off in the 80s

Kinetic analysis of ligase reaction velocity in RNA-seq protocols• Existing protocols include abundant cofactors (sharp

contrast to the experiments in 1970s)Drive reaction to the right Rate limiting single step reaction instead of tri-step oneSubstrate preference (bias in reaction yields) is not eliminated

• Multi-substrate inhibition from all biosample sequences available from ligation

Analytical series approximation for ratios of random variables

• Ligase operates at the 1st order domain of Michaelis-Menten kinetics

𝑉 𝑖=𝑉 𝑖

𝑚𝑎𝑥 𝑋 𝑖

𝐾 𝑀𝑖 (1+∑

𝑖

𝑋 𝑗

𝐾𝑀𝑗 )≈

𝑉 𝑖𝑚𝑎𝑥 𝑋𝑖

𝐾𝑀𝑖 (1+𝑛 𝐸 [ 𝑋 ]

𝐸 [𝐾𝑀 ] )=𝑉 𝑖

𝑚𝑎𝑥 𝑋𝑖

𝐾𝑀𝑖 ¿¿

Testable model predictions about ligase bias in RNA-seq experiments

Mathematical expression Implications for ligase bias• Concentration independence• Sample composition

independence• Transferable within

experiments done with the same protocol

• Protocol dependent (reaction velocity incorporates concentration of cofactors and enzyme)

• Sequence equimolar mixes to derive empirical correction factors for ligase bias

• Apply those to biological samples (“offsets” in distributional regression) to eliminate bias

There is substantial variation in raw sequence counts from equimolar mixes

Application of bias factors virtually eliminates ligase bias

Monte Carlo Cross Validation in 3 equimoral datasets: randomly split the dataset into learning and testing subsets, learn the correction factor and apply it to correct the estimates of the learning dataset. Repeat N times

Empirical factors nearly eliminate bias between equimolar datasets with 10x different input (Galas Lab legacy datasets)

Bias factors in public non-equimolar short-RNA seq

datasets

Design of Validation Experiments

What has been established?• Moderate

concentration independence

• Ability to nearly eliminate bias over at least two orders of magnitude

• Legacy platforms/experiments

What needs to be proven?• Concentration

independence over >2 orders of magnitude

• Sample composition independence

• Recovery of differential expression measures

• Any value relative to existing approaches?

Validation Experiments

Collaboration between PNRI (Galas Lab) and UNM (DoIM)The largest, single protocol, technical series to date (GSE93399)

Experimental Group Dilution NmiRExplore (972 short RNAs)

1:10 10

286 miRNAs 1:1 8 1:10 8 1:100 8 1:1000 8Ratio Metric Series A (descending)

Mix of 286 subpool A (1:1) 286 subpool B (1:10) 286 subpool C(1:100) 286 subpool D (1:1000)

8

Ratio Metric Series B (ascending)

Mix of 286 subpool A (1:1000) 286 subpool B (1:100) 286 subpool C(1:10) 286 subpool D (1:1)

8

Total 7 groups (58 sequenced x 2 = 116)

Empirical bias correction over 3 orders of magnitude in equimolar

datasets

RMSE reduction: 77%-90% (input in calibration run differs by up to x10 from target), 54%-67% otherwise

Empirical factors reduce bias by nearly 60% in non-equimolar series

Bias correction recovers expression profile

patterns

Bias Correction in Heterogeneous Samples

• Correction factors remove ~55% of bias between equimolar samples

• ~ 70% of RNAs have expression within two fold from the mean (from 23%)

• Bias reduction is ~40% in ratiometric series

• ~63% of RNAs have expression within x2 from the mean (from 33%)

Differential ExpressionWhen more is less, and simplest is the best

Our proposal for a model of differential expression (DE)

changesStatistical formulation and assumptions

(similar model for variance)1. Expression in reference state is

not of prime scientific interest (can omit correction for bias)

2. Technical sources of variation (PCR efficiency, library sampling) of much smaller magnitude than biological variability

Parameter interpretation and context of use

• Accommodates global and sequence specific DE changes

• Flexible modeling of referent (global level and variation around it)

• Still models counts• No incorporation of

library specific factors (model is un-normalized)

• Number of reads in sample j, assigned to species i (Ki,j)• Assumed to follow a negative binomial distribution:

Existing Models for RNA-seq experiments

Standard deviation

= (edgeR1)

1Biostatistics 2008, 9:321-322 Genome Biology 2010, 11:R106

= (DESeq2)

Mean = Common scale(coverage of the library, sequence depth)

Experimental Effects

iijim ,1,0, )log(

miRNA expression in thecontrol group

miRNA expression in theexperimental group

Model for differential expression analysis

Comparison of proposed approach

against existing methods

“We” (gamlss)• Uses the NB or the LQNO• LQ relation between mean and

variance• Variance and mean parameters

are estimated simultaneously• Explicit count based modeling• Un-normalized• Shrinkage via random effects

modeling• Derived from first principles (a

generative probability model)

“They” (edgeR/DESeq2 etc)• NB or the linear model• LQ or flexible relation between

mean & variance• Two stage procedure to estimate

parameters• Models counts as % of a given

library depth• Normalized (% sum to one)• Shrinkage via random effects

modeling • Ad hoc, phenomenological

probability model

Scenarios of differential expression to assess method performance

• Clustered, symmetric differential expression1. fraction of overexpressed sequences is equal to that of the

underexpressed 2. no change in global expression over and underexpressed RNAs are present in equal numbers

and exhibit same degree of DE • Asymmetric, clustered differential expression

1. Fraction of overexpressed sequences ≠ underexpressedDrives global expression change to one direction

• Global Change: all RNAs exhibit a variable but consistent directional change of expression

• No change

All scenarios implemented through the validation datasets

The GAMLSS has smaller RMSE than 10 popular

workflows for DE analysis• Performance

benefit seen under scenarios of asymmetric, clustered differential expression changes

• When DE are (nearly) symmetric, many other methods have similar performance

Existing methods cannot detect global, directional differential expression

Algorithm performance in the absence of differential expression

GAMLSS demonstrates the optimal balance between False Omission and False Discovery Rates

ROC Curve Analysis FDR and FOR

What did we just find out about algorithms for DE analysis?

• Proposed method (GAMLSS) is the top performer:• Symmetric, clustered, DE changes• Asymmetric clustered, DE changes• Asymmetric global, DE changes• No DE changeOptimal balance between FDR and FOR

• Existing methods introduce moderate – to – severe bias• force the overall DE to sum to zero (what goes up must be

accompanied by something that goes down)• Voom/limma somewhat more resilient, near identical performance

to GAMLSS under symmetric DEThese patterns have not seen before, because no-one to date

has generated datasets with known composition/DE

Why do existing methods fail to deliver?

• Existing models for RNA-seq analysis e.g deSEQ, edgeR can be derived from 1st principles as approximations

• RNA-seq counts as % of library depth• Valid for dilute samples, not dominated by a few

RNA species• Library size depth and modeling counts as %

(a relic of the SAGE era) may be a disastrous distraction

• Parameterization constraints DE over all RNAs included in the analysis to sum to zero

Practical implications for experimentalists (not using GAMLSS)

• Any change to the population of RNAs modelled (e.g. filtering)→ different DE values from the same dataset

• Both type M (degree of DE changes) and type S (label an over-expressed sequence to be under-expressed & vice-versa) errors

• Up to 25% of estimated DE changes may be of the wrong direction• Up to 100% of estimated DE changes may be of the wrong

magnitude

• RNA-seq findings will fail to validate against qPCR• Reputation of RNA-seq as a semiquantitative technique of

poor reproducibility is due to statistical methodology

MicroRNA regulationA control theory perspective

microRNA biology & therapeutic applications

http://www.nature.com/nature/journal/v469/n7330/fig_tab/nature09783_F1.htmlhttp://www.nature.com/nature/journal/v469/n7330/full/nature09783.html

http://www.nature.com/nature/journal/v469/n7330/fig_tab/nature09783_F1.html

http://www.nature.com/nature/journal/v469/n7330/full/nature09783.html

Control In Biological Systems Is Many-To-Many, Cooperative And Patterned

Feala JD, et al. PLoS ONE 7(1): e29374. (2012)Riba A et al PLoS Comput Biol 10(2): e1003490. (2014)

Bipartite Control Network Topologies miRNA – Transcription Factor circuits

Feed Forward Loop: master control layout in many natural and artificial control systems

How do we control things?

Predictably simple (open loop)

Error Correcting (feeback)

Model based (feed forward)

Feed forward control• Control element responds to a change in the

environment in a predefined manner• Based on prediction of plant (“what is being

controlled”) behavior (requires model of the system)

• Can react before error actually occurs (stabilizing the system, e.g. cerebellum control of balance)

• Benefits: reduced hysteresis, increased accuracy, cost-efficiency, lower “wear-tear”

Practical implications• miRNAs function as master controllers in FFLs

• biology is intrinsically NOT model free

• miRNA profiling reveals the “plant” dynamics of complex biological processes

• Emerging data suggest that sequence variation may underline (dys-)regulation

• miRNA associations are by definition causal to some aspects of a particular phenotype

• “a priori plausible” biomarkers• direct therapeutic implications

• Examination of the “plant” (targets) may have implications for microRNA research

• Context for the interpretation of microRNA changes• “Stronger” biomarker signatures

microRNAs are rational candidates for exploring paradigm shifts in biology

• Ubiquity-conservation• Breadth & width of regulation (>60% of genes)• Context-specificity (“meta-controller”)• Master Controllers in Feed Forward Loops

These arguments are not disease area specific (e.g. apply equal well to cancer or even psychiatric disease)

MicroRNAs as biomarkersRenal, Diabetes and Cardiometabolic Disease

• 8-10% of the population suffer from diabetes• 20-30% of patients with diabetes will develop evidence of

diabetic chronic kidney disease (DKD/CKD)• DKD progresses in stages of increasing proteinuria• 50% of patients with overt nephropathy will develop End

Stage Renal Disease (ESRD) within 10 years• The end result: Diabetic nephropathy is the leading cause

of ESRD, requiring dialysis or kidney transplantation accounting for 40% of cases

Facts, figures and the natural history of cardiometabolic and

renal disease in diabetes

• DKD is costly:• 40-50% of the $44B Medicare expenditures for CKD• 40-50% of the $50B total healthcare costs for ESRD

• DKD is lethal (>50% of these deaths are cardiac)

• Current therapies reduce risk by 30%• Many of the things we tried to stabilize renal function AND improve

cardiovascular disease failed miserably in trials• A paradigm change in our understanding of DKD is warranted => We

posit that miRNAs will trigger this shift• This improvement likely spread to other areas given biology of

cardiovascular disease (“extreme phenotype”)

There is a significant unmet need for therapies that stabilize progression and reduce death rates in patients with diabetic kidney disease

1Afkarian et al J Am Soc Nephrol. 2013 Feb;24(2):302-8

US1 population

No Diabetes Diabetes

No CKD 7.7% 11.5%

CKD 17.2 31.1%0 10 20 30 40 50 60

4050

6070

8090

100 Dialysis Mortality

Time (months)

% S

urvi

ving

GNDM

Why bother with microRNAs in DKD?

Heart & Vessels

• Angiogenesis• Vascular inflammation• Atherosclerosis• LVH• Vascular tone• Endothelial dysfunction

Kidney

• Water homestasis• Osmoregulation• Calcium sensing• Sodium, potassium,

acid base handling• Renin production• Renal development• Renal senescence• EMT• Collagen production

• Insulin synthesis and secretion

• Peripheral tissue sensitivity

• Hepatic glucose production

• Inflammatory gene expression

microRNAs as Minimally Invasive Biomarkers : a metrological argument

Advantages of microRNAs Circulating microRNAs•More stable in circulation than mRNAs•High expression level and low complexity compared to mRNA•Tissue specific expression•Availability of analytical platforms

Keep getting cheaper over time•Sequence conservation

Allows translation of clinical associations to animal models

Allows translation of animal models to clinical applications

Cortez et al Nat Rev Clin Oncol. Jun 7, 2011; 8(8): 467–477.

Targets of differentially expressed miRNAs in early and late stages of DN map to overlapping pathways MA v.s. NA Overt vs Normal

Pathway P-value Fraction P-value FractionSignal Transduction

Signaling by SCF-KIT 0.006 18/76 0.001 41/76Signaling by Insulin receptor 0.009 23/109 <0.001 65/109Signaling by NGF 0.016 38/212 <0.001 119/212Signaling by Rho GTPases 0.024 24/125 <0.001 71/125Signaling by ERBB4 0.027 16/76 <0.001 45/76Signaling by ERBB2 0.035 19/97 <0.001 59/97Signaling by PDGF 0.040 22/118 <0.001 67/118Signaling by VEGF 0.041 4/11Signaling by EGFR 0.044 20/106 <0.001 64/106Dowstream signaling of activated FGFR 0.038 19/98 <0.001 61/98Signaling by BMP 0.001 16/23Signaling by TGFβ 0.004 11/15DAG and IP3 signaling 0.010 20/31PIP3 activates AKT signaling 0.020 15/26RAF/MAP kinase cascade 0.031 7/10Signaling by Notch 0.036 13/23Interaction of integrin α5β3 with fibrillin 0.044 2/3Interaction of integrin α5β3 with von Willbrand factor 0.044 2/3Integrin cell surface interactions 0.024 40/85

Cell-Cell Communication 0.009 57/122Cell Cycle

G0 and early G1 0.040 12/21

Leveraging the RNA-seq analytical methodologyTo boldly go where no one has gone before (but many have tried)

Goals of a microRNA research program in cardiometabolic, renal and diabetes diseases

• Use carefully designed case-control, before-after, randomized controlled trials, and n-of-1 trials for the following goals:

1. Personalized medicine applications (diagnosis/prognosis/precision medicine)

2. Biomarker discovery (e.g. to aid trials)3. Novel Therapeutics

Animal Models

Clinical Associations

Clinical Interventions

A microRNA driven discovery process

BiomarkerDiscovery

MechanisticInsights

Therapeutics

Clinical Science, Bioinformatics, Systems Biology Driven “Reverse Translation”

Translational Science

Evidence Based Medicine

Basic Science

Ingredients for success of a microRNA regulation discovery program

Requires open-ended platforms (RNA-seq)o Especially for kidney disease due to intrarenal RNA editing

Requires unbiased quantification between groups of patients (differential expression analysis)

Requires unbiased and accurate quantification in the absence of a controlled comparison (diagnostics – bias correction)

Proposed approach: GAMLSS for RNA-seq satisfies requirements better than all currently used methods

Measurement in clinical diagnosticsWhat we want to happen What actually happens

Patient 1

10,10

Measurement is reproducibleMeasurement shows minimal inter-individual variationMeasurement shows minimal intra-individual variation

JAN

UARY

JUN

E

Condition A

JAN

UARY

JUN

E

Patient 2

10,10

Patient 3

15,15

Condition B

Patient 4

15,15

Patient 1

10,10

Condition A

Patient 2

10,10

Patient 3

15,15

Condition B

Patient 4

15,15

Patient 1

10,18

? Condition

Patient 2

13,10

Patient 3

15,10

Condition B

Patient 4

15,18

Patient 1

10,12

Patient 2

15,14

Patient 3

18,11

Patient 4

14,19

Condition A ? Condition

Condition A Condition B ? Condition

Condition BMeasurement is non-reproducibleMeasurement shows high inter-individual variationMeasurement shows high intra-individual variation

• Understand and control for the sources of variation • Use calibration sets as references• A measurement is instrument specific• Global reference standards (role for highly competent

labs that maintain the standards)• Context of use:

• Detector (“out-of-limits” readings)• Control (“track the course”)

Lessons from clinical chemistry labs

• Use GAMLSS as the prime analytical tool to analyze short RNA-seq data as it correctly represents all sources of variation and can use calibration (equimolar) runs

• Combine this with a protocol that experimentally controls variation (e.g. 4N protocol of the Galas Lab)

Measurement in experimental samplesWhat we want to happen What actually happensCondition A Condition B

10, 10, … , 10 15, 15, … , 15

B > A

Certain of the differenceMeasurement is reproducibleMeasurement shows no variation

RUN

1RU

N 2

Condition A Condition B

10, 10, … , 10 15, 15, … , 15

B > A


11, 7, … , 10 8, 19, … , 26

B > A

Uncertain of the differenceMeasurement is non-reproducibleMeasurement shows high variation

RUN

1RU

N 2


120, 90, … , 130 150, 60, … , 20

B < A

• Use GAMLSS as the prime analytical tool to analyze short RNA-seq data as it optimizes discovery/omission rates & exhibits the least bias

• BUT what do these correctly/unbiasedly assessed DE changes mean?

Understanding the context for differential expression changes• A list of de-regulated targets will not by itself

support the microRNA discovery process• Need some context to interpret changes and guide

further research• This context is provided by analysis of microRNA

targets• We have proposed and applied a formal target

analysis methodology in our early diabetic nephropathy investigations

Formal Target Analysis: A Biochemical Primer

1. Hill plot:

2. Fold change between two states:

3. Change in binding between the two states

4. Means and standard errors for the fold changes can be synthesized using random effects meta-analysis

5. Integration of fold changes from different experiments

dKL loglog)logit(1

log

FC

R

E

LL

2log2

2loglog)(log)logit()logit( 2 FCREORRE

• Use GAMLSS as the prime analytical tool to analyze differential expression in short RNA-seq data as it achieves the smallest error among algorithms

http://www.pdg.cnb.uam.es/cursos/BioInfo2002/pages/Farmac/Comput_Lab/Guia_Glaxo/chap3b.html

The 1st grade approach to target analysis

Heuristic Argument: count the number of miRNAs with small p values

• Total Score (TS)= # of differentially expressed miRNAs predicted to bind to a given target

• Regulation Score (RS)= # over-expressed- # under-expressed miRNAs predicted to bind to a given target

TS Low High

RS

- -

0 0

+ +

Low Signal To Noise Ratio

Target likely disinhibited

Target likely neutrally modulatedTarget likely inhibited

• Use GAMLSS as the prime analytical tool to analyze putative targets of differentially expressed microRNAs as it achieves the optimal balance between FDR/FOR

Target Analysis for PDGF-Beta in patients with overt diabetic kidney disease (DKD)

Study

Fixed effect modelRandom effects modelI-squared=0%, tau-squared=0, p=0.9656

hsa-let-7a-5phsa-let-7b-5phsa-let-7chsa-let-7d-5phsa-let-7e-5phsa-let-7f-5phsa-let-7g-5phsa-let-7i-5phsa-miR-106a-5phsa-miR-106b-5phsa-miR-122-5phsa-miR-1224-3phsa-miR-134hsa-miR-140-3phsa-miR-17-5phsa-miR-1909-3phsa-miR-1913hsa-miR-204-5phsa-miR-20a-5phsa-miR-20b-5phsa-miR-2110hsa-miR-2113hsa-miR-324-3phsa-miR-329hsa-miR-335-5phsa-miR-342-3phsa-miR-361-3phsa-miR-450b-3phsa-miR-491-5phsa-miR-501-5phsa-miR-545-3phsa-miR-558hsa-miR-603hsa-miR-608hsa-miR-663bhsa-miR-765hsa-miR-93-5p

TE

0.80-0.46-0.30 0.61 0.22 0.32 0.71 0.45 0.37 0.37-0.06 1.52 0.44 0.08 0.51 0.32 0.83 0.43-0.12 0.33 0.09 0.55 0.07 0.14 1.78-0.10 0.05 0.74 0.60-0.08-0.01 0.27-0.64 0.11-0.41 0.72-0.12

seTE

0.58930.57090.56810.63480.56360.56040.60510.64790.57210.65780.57520.71480.54140.63000.68820.52860.54300.54500.57360.79840.54510.73090.55030.54240.63240.58100.59910.61660.69920.73410.78300.53980.53100.74240.88230.58780.5416

0.2 1 2 5 15 50 150

Odds Ratio

Expression Ratio

OR

1.331.33

2.230.630.741.851.241.382.041.561.451.450.944.561.561.091.671.382.281.540.891.391.091.741.071.155.900.901.052.101.810.930.991.310.531.120.662.060.89

95%-CI

[1.09; 1.61] [1.09; 1.61]

[0.70; 7.09] [0.21; 1.93] [0.24; 2.26] [0.53; 6.42] [0.41; 3.75] [0.46; 4.14] [0.62; 6.66] [0.44; 5.56] [0.47; 4.46] [0.40; 5.26] [0.31; 2.91]

[1.12; 18.50] [0.54; 4.50] [0.32; 3.73] [0.43; 6.42] [0.49; 3.88] [0.79; 6.62] [0.53; 4.48] [0.29; 2.73] [0.29; 6.63] [0.38; 3.18] [0.41; 7.28] [0.36; 3.14] [0.40; 3.34]

[1.71; 20.39] [0.29; 2.81] [0.32; 3.39] [0.63; 7.03] [0.46; 7.14] [0.22; 3.90] [0.21; 4.60] [0.45; 3.76] [0.19; 1.50] [0.26; 4.79] [0.12; 3.74] [0.65; 6.50] [0.31; 2.57]

W(fixed)

100%--

2.8%3.0%3.1%2.5%3.1%3.2%2.7%2.4%3.0%2.3%3.0%1.9%3.4%2.5%2.1%3.5%3.4%3.3%3.0%1.6%3.3%1.9%3.3%3.4%2.5%2.9%2.8%2.6%2.0%1.8%1.6%3.4%3.5%1.8%1.3%2.9%3.4%

W(random)

--100%

2.8%3.0%3.1%2.5%3.1%3.2%2.7%2.4%3.0%2.3%3.0%1.9%3.4%2.5%2.1%3.5%3.4%3.3%3.0%1.6%3.3%1.9%3.3%3.4%2.5%2.9%2.8%2.6%2.0%1.8%1.6%3.4%3.5%1.8%1.3%2.9%3.4%

Target Gene: PDGFB

Target analysis of the NFE2L2/Nrf2 pathway in DKD

Target analysis of the TGF-beta pathway in DKD

To boldly go where no one has gone before….

Methodological• Extend the model to account

for abundance dependent variations in PCR efficiency

• Incorporate target analysis into count analysis

• Estimate ligase bias from the sequence (computationally derived correction factors)

microRNA biomarkers projects• COMPASS: a community disease

detection program focusing on diabetes and CKD in rural New Mexico

• MIRROR-Transplant: metabolic and immunological factors contributing to kidney transplant failure

• DIDIT: randomized controlled trial to preserve urine production in patients starting dialysis

• Potential areas for collaboration in the NIH biorepository?

Summary• A generative, probability, model for the counts of short

RNA-seq measurements was developed• This model may be used to estimate and substantially

correct for the presence of ligase bias• It achieves superior performance (smaller error, optimal

balance of false discoveries and omissions) than other competing methodologies

• Can be used to power “personalized” medicine applications or experimental state comparisons

• Formal target analysis to guide further research (“reverse-translation”)

Acknowledgements• This work could not have been completed without the

collaboration of the Galas Lab at PNRIDavid Galas: provided a friendly ear that had the

patience to listen, comment and risk time and funds for the experiments

Alton Etheridge: pushed for extensive sequencing and resequencing and carried out all the validation experiments

Nikita Sakhanenko: had the patient to be our software tester, validator and GEO submitter

• This work would not have started without John P (Nick) Johnson (University of Pittsburgh) who kicked me into the area about 8 years ago

https://bitbucket.org/chrisarg/rnaseqgamlss

?

Backup

Building the model from first principles

• Establish statistical distributions OR deterministic relationships that “bind” together the quantities in successive steps

• There is a “competitive qPCR” experiment beating inside each RNA-seq dataset random

• Ligase bias is reproducible deterministic/systematic

• Apply marginalization (integration) operations to “flatten” the hierarchy

• Derive the exact distributions (or the limits of approximation) for a statistical model that directly represents the quantity of interest

• Relate model parameters to quantities of interest (absolute/relative quantification)

Facts about the distribution of RNA-seq data

• Established relationships between distributions that were first explored in the 1920-1930s

• Rare biomedical applications in the 1940s• Theoretical work in the early 1960s• Lead goes cold due to failure to conceptualize

practical applications after the 1960s• Extremely involved expressions involving special

functions of mathematical physics (parabolic cylinder functions) numerical complexities will hinder attempts to use them as-is in applications

Rediscovering a Negative Binomial parameterization and introducing a new Gaussian Generalized Linear Normal Family

• Large scale numerical simulations (>500,000) to establish approximations for the RNA-seq distribution

• Arbitrary precision libraries in python in multicore machines

• Low precision – but acceptable for statistical computations

• Both approximations implement a LQ relationship between the mean and variance

• Inferences are largely the same (shown in synthetic mixes)

Two equivalent views of measures of differential expression: Fold Change and Probability of Over-Expression

• The GLM approach (limma, DESeq/DESeq2, gamlss ) yield measures of differential expression for microarrays, RNA-Seq or qPCR experiments

• These are estimates of fold changes (noise) and their associated standard errors (signal)

• They can be converted to probability estimates(= ) about the signal being >0 (overexpressed) v.s. <0

• The standard error of is given by

-2 -1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

Fold Change

EstimatedFold Change

Fold Change = 1.0, SE=1.0, shaded area (=1.0-pnorm(0,FC,SE) in R) yields probability of overexpression

Computing probability of differential expression (pDE) in R

Why do we need two views of the same data?

The FC View• Absolute, relative

quantification is possible• Fold changes in one miRNA

are directly comparable against each other

• Fold changes are comparable between and within techniques

• Type I and II statistical errors

The pDE View• Only relative, relative

quantification is possible• Platforms provide

evidence for directional changes in expression

• Type M and S errors• Provides input to Systems

Biology tools (e.g Boolean Networks)

• Experimental work in late 19th century to discover the physiological basis of coagulation (“prothrombin”)

• Development of different versions of the “Prothrombin Time”: investigations in hemophilia, post-op bleeding & liver disease (1930s-1950s): derived the normal range and ranges associated with specific deficits

• Pre-analytical considerations throughout the 1950s (and even today)

• In the 70s PT was used to monitor and dose warfarin in the clinic• Classical studies in the 70-80s demonstrate high inter, intra and

analytic variability (despite > 30 years of standardization)• WHO proposed to standardize the test in the mid 1980s through

the use of the INR (international normalized ratio)

Solid measurements for thinning one’s blood: the history of the PT test

http://www.clinchem.org/content/51/3/553.fullhttp://circ.ahajournals.org/content/19/1/92.full.pdfThromb Haemost. 1985 Feb 18;53(1):155-6.

http://www.clinchem.org/content/51/3/553.full

http://circ.ahajournals.org/content/19/1/92.full.pdf

The cautious story of the INRNormalization procedure

•

• PTnormal : Geometrical mean of 20 patients

Sources of variation• Different methods to measure

the PT• Different instruments that

implement each method• Different calibrator sets for

each instrument!

http://www.who.int/bloodproducts/publications/WHO_TRS_889_A3.pdfhttp://www.clinchem.org/content/56/10/1618.fullhttp://www.clinchem.org/content/51/3/553.full

http://www.who.int/bloodproducts/publications/WHO_TRS_889_A3.pdf



Statistics Of Biological Regulatory Networks

Feala JD, et al. PLoS ONE 7(1): e29374. (2012)

Pathophysiology of the cardiorenal syndrome

http://www.kdigo.org/meetings_events/pdf/KDIGO%20CVD%20Controversy%20Rpt.pdf

Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease

Health & Medicine