Computational techniques for understanding host ...
Post on 28-Feb-2022
2 Views
Preview:
Transcript
12 November 2019
Fergal DuffyTB Bioinformatics Working Group
Computational techniques for understanding host transcriptional responses to TB
Talk Overview• Host transcriptional responses to TB
1. Sources of open data
2. Pre-processing transcriptional profiles
3. Cross-study comparisons • Case study: Development of ACS-CoR
4. Cross-species comparisons• Case study: Mouse ULD signature
A typical workflow1. Acquire data
– Yours or public data– RNAseq or microarray
2. Preprocess– Normalization
3. Downstream analyses– Differential expression– Cross study comparisons– Predictive models
Understanding host transcriptional responses
• Most common: systematic responses in blood
– Either whole blood or PBMCs
– Using microarrays or RNAseq or …
– 12,000 – 40,000 “measurements”• Can be array probes/gene counts/splice junctions
Why blood?
• Local responses determine granuloma outcome
• But we can still learn from systemic responses
• Blood is – Accessible– Many tools to analyze– Suitable for use in clinical
diagnostic
Lin et al, 2014
Many blood transcriptional signatures to diagnose TB
Sweeney et al, 2016
• Table lists human whole-blood microarray study cohorts available in GEO as of 2016
• Total of ~2,500 transcriptional profiles including TB, LTB and related diseases and healthy individuals
• These have been the basis of many TB diagnostic signatures
• Meta-analysis led to 3-gene signature• GB5, DUSP3, KLF2
Many blood transcriptional signatures to diagnose TB
Open TB transcriptional data on GEO(1) Build your query https://www.ncbi.nlm.nih.gov/gds/advanced
(2) Look for result Series
Open TB transcriptional data on SRA/BioBProjecthttps://www.ncbi.nlm.nih.gov/bioproject/
Open TB transcriptional data on ArrayExpress
https://www.ebi.ac.uk/arrayexpress/
Search with Booleans (‘OR’ ’AND”) and filter to find human array assays.
Microarray basics• Labelled fluorescent
transcripts hybridized to probes
• Chip scanned to image, continuous intensity values represent gene expression levels
Microarray normalization• Background Correction
• Log transform – Results in approximately normally
distributed expression values
• Quantile Normalization between arrays
• R limma package
RNAseq basics• Count number of reads
mapping to areas of interest– Different statistical
approaches to count-based vs continuous measurements (e.g. microarrays)
• Also consider sequencing depth
RNAseq normalization• Estimate mean and variance of counts
and use to correct bias in low counts– Negative binomial– Poisson
• Correct for sequencing depth– Counts per million
• Log transform
• R limma/DESeq packages
Differential Expression Analysis• Does my gene of interest differ
across two (or more) conditionsin one study?
– Same approach for microarrays/RNAseq
– t-test, linear models, ANOVA
– Correct for multiple test: false discovery approach
Challenges for signature development • How do we design a statistically interpretable, robust and
universal signature?– n (samples) << p (genes)
• How do we deal with technical variation/batch effects between studies?
• How do we translate to new sequencing technologies?
Case Study: ACS-CoR
Case Study: ACS-CoR
• Cohort: Latent TB+ South African Adolescents– Recruited and prospectively followed for 2 years– 46 individuals developed active TB (Progressors) matched to 107 Control individuals– Whole blood RNAseq at enrollment
• optionally at 180 days, 360 days, 540 days post enrollment
• Goal: Develop a transcriptional signature of TB risk– Look for transcriptional responses specific to Progressors prior to disease onset
Zak et al, 2016
ACS-CoR signature discovery workflow
• Derived & cross-validated a TB risk signature (RNAseq) on LTB+ adolescents– ACS 3:1 training: test split
• Translated to qPCR and validated on a new cohort of TB household contacts
ACS-CoR: Training a Pairwise SVM• ACS-CoR is based on pairs of
splice junctions1. Pick differentially expressed
splice junctions2. Make junction pairs3. Fit linear decision boundary
(Support vector machine: SVM)4. Select best pairs5. All pairs ‘vote’ to classify new
samples
Advantages• Interpretable• Mathematically straightforward• Visualizable• Voting is robust to missing genes
RNAseq enables splice junction counting
Exons
Lines represent junction-spanning reads
ACS-CoR: Training a Pairwise SVM• 5X Cross-validation:
– Randomly split your data into 5– Train on 4x, predict on 1x– Unbiased estimate of prediction
performance
• Also Internal cross validation was used to select robust gene pairs
ACS –CoR: Housekeeping normalization
Junction Genechr2:114713283-114714936 ACTR3chr11:67050699-67051177 ADRBK1chr11:67051844-67052317 ADRBK1chr1:22413359-22417920 CDC42chr1:115261366-115262199 CSDE1chr1:115262363-115263159 CSDE1chr1:115260837-115261233 CSDE1chr2:158272655-158275034 CYTIPchr5:176859807-176860147 GRK6chr5:176778292-176778452 LMAN2chr5:176764786-176765488 LMAN2chr5:176765606-176778173 LMAN2chr12:50149538-50152009 TMBIM6chr12:50152263-50152465 TMBIM6chr12:50152545-50153003 TMBIM6chr12:50152058-50152165 TMBIM6chr12:50153104-50155486 TMBIM6chr1:154130197-154142875 TPM3chr1:154142945-154143124 TPM3chr19:35761500-35761620 USF2
Reference junctions used for normalization of the signatureExpression relative to a panel of consitutively expressed housekeeping junctions = more robust to technical variability and facilitates translation to PCR
Expression relative to mean
of reference junction
Expression = log(JunctionnormCPM) – log(mean(refJunctionnormCPM))
ACS –CoR Signature DesignJunction Number Gene Window Strand
J1 ANKRD22 chr10:90588423-90591591 -J2 APOL1 chr22:36657768-36661196 +J3 BATF2 chr11:64762021-64764347 -J4 ETV7 chr6:36322464-36334651 -J5 ETV7 chr6:36334539-36334651 -J6 ETV7 chr6:36336848-36339106 -J7 FCGR1A chr1:149754330-149754725 +J8 FCGR1A chr1:149760173-149761609 +J9 FCGR1B chr1:120928615-120930038 -
J10 FCGR1B chr1:120930293-120934380 -J11 FCGR1B chr1:120935468-120935863 -J12 GBP1 chr1:89519151-89520364 -J13 GBP1 chr1:89520558-89520795 -J14 GBP1 chr1:89520898-89521698 -J15 GBP1 chr1:89521911-89522536 -J16 GBP1 chr1:89522817-89523674 -J17 GBP1 chr1:89523917-89524523 -J18 GBP1 chr1:89524726-89524999 -J19 GBP1 chr1:89525109-89525879 -J20 GBP1 chr1:89526007-89528727 -J21 GBP1 chr1:89528936-89530842 -J22 GBP2 chr1:89573974-89575359 -J23 GBP2 chr1:89575553-89575846 -J52 STAT1 chr2:191848466-191849035 -J53 STAT1 chr2:191849119-191850344 -J54 STAT1 chr2:191850386-191851579 -J55 STAT1 chr2:191851673-191851764 -J56 STAT1 chr2:191851794-191854340 -J57 STAT1 chr2:191854400-191855953 -J58 STAT1 chr2:191856046-191859786 -J59 STAT1 chr2:191864430-191865799 -J60 STAT1 chr2:191872387-191873688 -J61 TAP1 chr6:32818926-32819885 -J62 TAP1 chr6:32820016-32820164 -J63 TRAFD1 chr12:112587675-112589604 +
Junction Number Gene Window StrandJ24 GBP2 chr1:89575949-89578142 -J25 GBP2 chr1:89575949-89578154 -J26 GBP2 chr1:89578367-89579698 -J27 GBP2 chr1:89579979-89582674 -J28 GBP2 chr1:89585971-89586825 -J29 GBP2 chr1:89586953-89587459 -J30 GBP4 chr1:89654477-89655720 -J31 GBP5 chr1:89726500-89727902 -J32 GBP5 chr1:89728468-89729418 -J33 SCARF1 chr17:1540149-1540234 -J34 SCARF1 chr17:1540356-1542099 -J35 SCARF1 chr17:1542220-1542932 -J36 SCARF1 chr17:1543036-1543205 -J37 SCARF1 chr17:1543960-1546735 -J38 SEPT4 chr17:56598521-56598614 -J39 SERPING1 chr11:57365794-57367351 +J40 SERPING1 chr11:57367850-57369507 +J41 SERPING1 chr11:57369642-57373482 +J42 SERPING1 chr11:57373686-57373880 +J43 SERPING1 chr11:57374020-57379189 +J44 SERPING1 chr11:57374020-57379300 +J45 SERPING1 chr11:57379409-57381800 +J46 STAT1 chr2:191840613-191841565 -J47 STAT1 chr2:191841751-191843581 -J48 STAT1 chr2:191843727-191844497 -J49 STAT1 chr2:191844592-191845345 -J50 STAT1 chr2:191845395-191847108 -J51 STAT1 chr2:191847244-191848367 -
Splice junctions, genes and chromosomal locations that comprise the signature
51 Junctions16 Genes
Visualizing the ACS-CoR signatureSingle Progressor vs ControlGreen lines indicate pair votes for ‘Control’, red for ‘Progressor’
All Training samplesColumns represent splice junctionsRows represent mean expression +/- SD
Quantifying ACS-CoR performance
• Sensitivity: Progressors predicted to be progressors (TPR) – TP / (TP + FN) = 6 / (6 + 9) = 40%
• Specificity: Controls predicted to be controls (TNR)– TN / (TN + FN) = 70 / (70 + 7) = 91%
True Progressors True Controls
Predicted Progressors
6 (TP) 7 (FP)
Predicted Controls
9 (FN) 70 (TN)
Example Confusion matrix: ACS test set
Quantifying ACS-CoR performance
• ACS-CoR provides a score, so have to pick a prediction threshold
• Tradeoff between sensitivity and specificity– High threshold, high specificity, low sensitivity– And vice-versa
Quantifying ACS-CoR performance
• ROC curves plot sensitivity vs specificity over all threshold values• Shown: Cross validation ACS-CoR scores on the training set
• Area under the curve represents model performance on ACS test set at all thresholds– 1 = perfect classifier– 0.5 = random guessing
Translating ACS-CoR from RNAseq to qPCR
• Motivation: – qPCR is cheap and accurate, suitable for low cost
screening test
– Enabled validation on a new cohort with a locked-down signature without further expensive RNAseq
Quantifying ACS-CoR performance
• ACS-CoR showed very similar performance on the independent test set for RNAseq and qPCR
Translating ACS-CoR from RNAseq to qPCR
• Splice junctions were translated directly to PCR primers (control and signature)
• All ACS cDNA was measured again via qPCR
• Individual SVM thresholds were rederived for each junction pair on the ACS training set
• GC6 validation set expression measured via PCR and ACS-CoR scores calculated
qPCR ACS-CoR validates on GC6 HHCs• GC6-74 was a study
of individuals after exposure to TB
• Like ACS, individuals enrolled were healthy, small number of Progressors
ACS-CoR was also translated to microarray
• ACS-CoR splice junctions pairs were collapsed into gene pairs
• All array probes mapping to these genes were identified and all possible probe pairs constructed and new SVM boundaries were fit for each resulting pair– Based on discriminating Berry et al, 2010 active TB vs latent TB samples
• Reparameterized array-based ACS-CoR used to classify existing microarray cohorts
• a
ACS-CoR was also translated to microarray
• ACS-CoR discriminates• Active TB vs latent TB• Active TB vs other
inflammatory/pulmonary diseases• In 5 existing cohorts
ACS-CoR Summary• ACS-CoR signature predicted TB risk
• Visualizable and interpretable
• Validated in independent test sets
• Translated to qPCR and microarray
• These didn’t happen by accident!
Conclusions• There is a large amount of freely available TB host
transcriptional data
• TB Host transcriptional responses are shared across species
• Use this data to support your own hypotheses!
AcknowledgementsULD mouse InfectionCourtney PlumleeKevin Urdahl
ACS & GC6 SignaturesDan ZakEthan ThompsonACS and GC6 study teams
Aitchison LabJohn Aitchison
RhCMV NHP dataScott HansenLouis Picker
top related