Computational techniques for understanding host ...

12 November 2019

Fergal DuffyTB Bioinformatics Working Group

Computational techniques for understanding host transcriptional responses to TB

Talk Overview• Host transcriptional responses to TB

1. Sources of open data

2. Pre-processing transcriptional profiles

3. Cross-study comparisons • Case study: Development of ACS-CoR

4. Cross-species comparisons• Case study: Mouse ULD signature

A typical workflow1. Acquire data

– Yours or public data– RNAseq or microarray

2. Preprocess– Normalization

3. Downstream analyses– Differential expression– Cross study comparisons– Predictive models

Understanding host transcriptional responses

• Most common: systematic responses in blood

– Either whole blood or PBMCs

– Using microarrays or RNAseq or …

– 12,000 – 40,000 “measurements”• Can be array probes/gene counts/splice junctions

Why blood?

• Local responses determine granuloma outcome

• But we can still learn from systemic responses

• Blood is – Accessible– Many tools to analyze– Suitable for use in clinical

diagnostic

Lin et al, 2014

Many blood transcriptional signatures to diagnose TB

Sweeney et al, 2016

• Table lists human whole-blood microarray study cohorts available in GEO as of 2016

• Total of ~2,500 transcriptional profiles including TB, LTB and related diseases and healthy individuals

• These have been the basis of many TB diagnostic signatures

• Meta-analysis led to 3-gene signature• GB5, DUSP3, KLF2

Many blood transcriptional signatures to diagnose TB

Open TB transcriptional data on GEO(1) Build your query https://www.ncbi.nlm.nih.gov/gds/advanced

(2) Look for result Series

Open TB transcriptional data on SRA/BioBProjecthttps://www.ncbi.nlm.nih.gov/bioproject/

Open TB transcriptional data on ArrayExpress

https://www.ebi.ac.uk/arrayexpress/

Search with Booleans (‘OR’ ’AND”) and filter to find human array assays.

Microarray basics• Labelled fluorescent

transcripts hybridized to probes

• Chip scanned to image, continuous intensity values represent gene expression levels

Microarray normalization• Background Correction

• Log transform – Results in approximately normally

distributed expression values

• Quantile Normalization between arrays

• R limma package

RNAseq basics• Count number of reads

mapping to areas of interest– Different statistical

approaches to count-based vs continuous measurements (e.g. microarrays)

• Also consider sequencing depth

RNAseq normalization• Estimate mean and variance of counts

and use to correct bias in low counts– Negative binomial– Poisson

• Correct for sequencing depth– Counts per million

• Log transform

• R limma/DESeq packages

Differential Expression Analysis• Does my gene of interest differ

across two (or more) conditionsin one study?

– Same approach for microarrays/RNAseq

– t-test, linear models, ANOVA

– Correct for multiple test: false discovery approach

Challenges for signature development • How do we design a statistically interpretable, robust and

universal signature?– n (samples) << p (genes)

• How do we deal with technical variation/batch effects between studies?

• How do we translate to new sequencing technologies?

Case Study: ACS-CoR

• Cohort: Latent TB+ South African Adolescents– Recruited and prospectively followed for 2 years– 46 individuals developed active TB (Progressors) matched to 107 Control individuals– Whole blood RNAseq at enrollment

• optionally at 180 days, 360 days, 540 days post enrollment

• Goal: Develop a transcriptional signature of TB risk– Look for transcriptional responses specific to Progressors prior to disease onset

Zak et al, 2016

ACS-CoR signature discovery workflow

• Derived & cross-validated a TB risk signature (RNAseq) on LTB+ adolescents– ACS 3:1 training: test split

• Translated to qPCR and validated on a new cohort of TB household contacts

ACS-CoR: Training a Pairwise SVM• ACS-CoR is based on pairs of

splice junctions1. Pick differentially expressed

splice junctions2. Make junction pairs3. Fit linear decision boundary

(Support vector machine: SVM)4. Select best pairs5. All pairs ‘vote’ to classify new

samples

Advantages• Interpretable• Mathematically straightforward• Visualizable• Voting is robust to missing genes

RNAseq enables splice junction counting

Lines represent junction-spanning reads

ACS-CoR: Training a Pairwise SVM• 5X Cross-validation:

– Randomly split your data into 5– Train on 4x, predict on 1x– Unbiased estimate of prediction

performance

• Also Internal cross validation was used to select robust gene pairs

ACS –CoR: Housekeeping normalization

Junction Genechr2:114713283-114714936 ACTR3chr11:67050699-67051177 ADRBK1chr11:67051844-67052317 ADRBK1chr1:22413359-22417920 CDC42chr1:115261366-115262199 CSDE1chr1:115262363-115263159 CSDE1chr1:115260837-115261233 CSDE1chr2:158272655-158275034 CYTIPchr5:176859807-176860147 GRK6chr5:176778292-176778452 LMAN2chr5:176764786-176765488 LMAN2chr5:176765606-176778173 LMAN2chr12:50149538-50152009 TMBIM6chr12:50152263-50152465 TMBIM6chr12:50152545-50153003 TMBIM6chr12:50152058-50152165 TMBIM6chr12:50153104-50155486 TMBIM6chr1:154130197-154142875 TPM3chr1:154142945-154143124 TPM3chr19:35761500-35761620 USF2

Reference junctions used for normalization of the signatureExpression relative to a panel of consitutively expressed housekeeping junctions = more robust to technical variability and facilitates translation to PCR

Expression relative to mean

of reference junction

Expression = log(JunctionnormCPM) – log(mean(refJunctionnormCPM))

ACS –CoR Signature DesignJunction Number Gene Window Strand

J1 ANKRD22 chr10:90588423-90591591 -J2 APOL1 chr22:36657768-36661196 +J3 BATF2 chr11:64762021-64764347 -J4 ETV7 chr6:36322464-36334651 -J5 ETV7 chr6:36334539-36334651 -J6 ETV7 chr6:36336848-36339106 -J7 FCGR1A chr1:149754330-149754725 +J8 FCGR1A chr1:149760173-149761609 +J9 FCGR1B chr1:120928615-120930038 -

J10 FCGR1B chr1:120930293-120934380 -J11 FCGR1B chr1:120935468-120935863 -J12 GBP1 chr1:89519151-89520364 -J13 GBP1 chr1:89520558-89520795 -J14 GBP1 chr1:89520898-89521698 -J15 GBP1 chr1:89521911-89522536 -J16 GBP1 chr1:89522817-89523674 -J17 GBP1 chr1:89523917-89524523 -J18 GBP1 chr1:89524726-89524999 -J19 GBP1 chr1:89525109-89525879 -J20 GBP1 chr1:89526007-89528727 -J21 GBP1 chr1:89528936-89530842 -J22 GBP2 chr1:89573974-89575359 -J23 GBP2 chr1:89575553-89575846 -J52 STAT1 chr2:191848466-191849035 -J53 STAT1 chr2:191849119-191850344 -J54 STAT1 chr2:191850386-191851579 -J55 STAT1 chr2:191851673-191851764 -J56 STAT1 chr2:191851794-191854340 -J57 STAT1 chr2:191854400-191855953 -J58 STAT1 chr2:191856046-191859786 -J59 STAT1 chr2:191864430-191865799 -J60 STAT1 chr2:191872387-191873688 -J61 TAP1 chr6:32818926-32819885 -J62 TAP1 chr6:32820016-32820164 -J63 TRAFD1 chr12:112587675-112589604 +

Junction Number Gene Window StrandJ24 GBP2 chr1:89575949-89578142 -J25 GBP2 chr1:89575949-89578154 -J26 GBP2 chr1:89578367-89579698 -J27 GBP2 chr1:89579979-89582674 -J28 GBP2 chr1:89585971-89586825 -J29 GBP2 chr1:89586953-89587459 -J30 GBP4 chr1:89654477-89655720 -J31 GBP5 chr1:89726500-89727902 -J32 GBP5 chr1:89728468-89729418 -J33 SCARF1 chr17:1540149-1540234 -J34 SCARF1 chr17:1540356-1542099 -J35 SCARF1 chr17:1542220-1542932 -J36 SCARF1 chr17:1543036-1543205 -J37 SCARF1 chr17:1543960-1546735 -J38 SEPT4 chr17:56598521-56598614 -J39 SERPING1 chr11:57365794-57367351 +J40 SERPING1 chr11:57367850-57369507 +J41 SERPING1 chr11:57369642-57373482 +J42 SERPING1 chr11:57373686-57373880 +J43 SERPING1 chr11:57374020-57379189 +J44 SERPING1 chr11:57374020-57379300 +J45 SERPING1 chr11:57379409-57381800 +J46 STAT1 chr2:191840613-191841565 -J47 STAT1 chr2:191841751-191843581 -J48 STAT1 chr2:191843727-191844497 -J49 STAT1 chr2:191844592-191845345 -J50 STAT1 chr2:191845395-191847108 -J51 STAT1 chr2:191847244-191848367 -

Splice junctions, genes and chromosomal locations that comprise the signature

51 Junctions16 Genes

Visualizing the ACS-CoR signatureSingle Progressor vs ControlGreen lines indicate pair votes for ‘Control’, red for ‘Progressor’

All Training samplesColumns represent splice junctionsRows represent mean expression +/- SD

Quantifying ACS-CoR performance

• Sensitivity: Progressors predicted to be progressors (TPR) – TP / (TP + FN) = 6 / (6 + 9) = 40%

• Specificity: Controls predicted to be controls (TNR)– TN / (TN + FN) = 70 / (70 + 7) = 91%

True Progressors True Controls

Predicted Progressors

6 (TP) 7 (FP)

Predicted Controls

9 (FN) 70 (TN)

Example Confusion matrix: ACS test set

• ACS-CoR provides a score, so have to pick a prediction threshold

• Tradeoff between sensitivity and specificity– High threshold, high specificity, low sensitivity– And vice-versa

• ROC curves plot sensitivity vs specificity over all threshold values• Shown: Cross validation ACS-CoR scores on the training set

• Area under the curve represents model performance on ACS test set at all thresholds– 1 = perfect classifier– 0.5 = random guessing

Translating ACS-CoR from RNAseq to qPCR

• Motivation: – qPCR is cheap and accurate, suitable for low cost

screening test

– Enabled validation on a new cohort with a locked-down signature without further expensive RNAseq

• ACS-CoR showed very similar performance on the independent test set for RNAseq and qPCR

Translating ACS-CoR from RNAseq to qPCR

• Splice junctions were translated directly to PCR primers (control and signature)

• All ACS cDNA was measured again via qPCR

• Individual SVM thresholds were rederived for each junction pair on the ACS training set

• GC6 validation set expression measured via PCR and ACS-CoR scores calculated

qPCR ACS-CoR validates on GC6 HHCs• GC6-74 was a study

of individuals after exposure to TB

• Like ACS, individuals enrolled were healthy, small number of Progressors

ACS-CoR was also translated to microarray

• ACS-CoR splice junctions pairs were collapsed into gene pairs

• All array probes mapping to these genes were identified and all possible probe pairs constructed and new SVM boundaries were fit for each resulting pair– Based on discriminating Berry et al, 2010 active TB vs latent TB samples

• Reparameterized array-based ACS-CoR used to classify existing microarray cohorts

ACS-CoR was also translated to microarray

• ACS-CoR discriminates• Active TB vs latent TB• Active TB vs other

inflammatory/pulmonary diseases• In 5 existing cohorts

ACS-CoR Summary• ACS-CoR signature predicted TB risk

• Visualizable and interpretable

• Validated in independent test sets

• Translated to qPCR and microarray

• These didn’t happen by accident!

Conclusions• There is a large amount of freely available TB host

transcriptional data

• TB Host transcriptional responses are shared across species

• Use this data to support your own hypotheses!

AcknowledgementsULD mouse InfectionCourtney PlumleeKevin Urdahl

ACS & GC6 SignaturesDan ZakEthan ThompsonACS and GC6 study teams

Aitchison LabJohn Aitchison

RhCMV NHP dataScott HansenLouis Picker

Computational techniques for understanding host ...

Documents

APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES

Randomized Techniques in Computational Geometry

Aalborg Universitet Computational optimization techniques...

Pathosystems Biology: Computational Prediction and Analysis....

Computational Techniques for Voltage Stability

Lecture 10 Computational tricks and Techniques

Computational Intelligence Intrusion Detection Techniques...

MTech Computational Techniques - Brochure

COMPUTATIONAL SAND PILE TECHNIQUES FOR DIFFUSE ACOUSTICAL...

Techniques in Computational Stochastic Dynamic Programming

Computational Analysis of Semi-Span Model Test...

Parallel Techniques for Computational Geometry

Computational Advertising: Techniques for Targeting ...

A REVIEW OF COMPUTATIONAL TECHNIQUES...

Computational Techniques

Computational Techniques for Fluid Dynamics: Specific...