Classification and Clustering for Hit Identification in High Content RNAi Screens
Post on 10-May-2015
935 Views
Preview:
Transcript
Classifica(on and Clustering for Hit Iden(fica(on in High
Content RNAi Screens
Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs
January 11, 2012
DNA Re-replication
Sivaprasad et al Cell Division
DNA replication is a tightly controlled and well-studied process. Proteins including geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!
Levels of geminin increase as cells enter S phase, which help to prevent a second round of DNA replication.!
After mitosis, levels of geminin and cyclins decrease through ubiqutin mediated degradation.!
Collaborator:!Mel Depamphilis, NICHD!Wenge Zhu, Georgetown U!
DNA Re-replication
Certain cancer cells may have less safeguards against DNA re-replication than normal cells (i.e. Achilles heel). Induction of re-replication results in apoptosis.!
Zhu et al, Cancer Res, 2009
Screening Protocol
• HCT-116 colon cancer cells are fixed and stained (Hoechst)!
• Image at 4X on ImageXpress!
• MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !
• Screens were run with singles and pools
Screen Summary
• Qiagen druggable genome library (6,866 genes) • 94 plates, 36K wells including controls
• Good screen performance, some poorer plates were redone
Plate Index
Statistic
0.5
0.6
0.7
0.8
0 20 40 60 80 100
Trimmed Z'
46
810
12140 20 40 60 80 100
SSMD
Goals
• Can we iden:fy genes with GMNN-‐like phenotypes – We already iden:fied a set of genes via thresholding the %G2 parameter
– We’d like to see what we get when we use a mul:-‐dimensional representa:on
• Employ predic:ve modeling to “learn” the phenotype
• Apply clustering and iden:fy biologically relevant clusters
What Do GMNN Wells Look Like?
Cell-‐Level Modeling
• A first approach was to match distribu:ons of individual wells with the overall distribu:on from the posi:ve control wells – Expected that distribu:on for GMNN wells should match the posi:ve control
– Use KS test to iden:fy wells with similar distribu:ons – Doesn’t work too well, even for GMNN itself – Considers 1 parameter at a :me (though a 2D KS test is possible)
Random Forest Model
• Ensemble of decision trees (Breiman 1984) • Not always the most accurate, but great for exploratory modeling – Implicit feature selec:on – Proven to not overfit – Provides a measure of feature importance
• Employ the randomForest package from R
h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html
Cell-‐Level Modeling
• Removed cells with “incomplete” parameters • S:ll leaves 291K posi:ve cases and 3M nega:ve cases
• Developed a random forest model, sampling from nega:ves to maintain balanced classes – Predict whether a cell is GMNN-‐like – Models from mul:ple samples of the nega:ve control exhibited similar performance
Posi-ve Nega-ve
Posi-ve 220,636 72,498
Nega-ve 35,614 257,520
Overall 18% error, 25% error on posi3ve class and 12% error on nega3ve class
Cell-‐Level Modeling
• Significant overlap between distribu:ons for the nega:ve and posi:ve controls
Cell-‐Level Predic(ons
• Aggregate predic:ons for all cells in a well to label a well as GMNN-‐like
• Iden:fy genes with >= 2 siRNA’s (ie wells) labeled as GMNN-‐like – 31 genes iden:fied (GMNN, KIF11, ESPL1, …)
• Iden:fied expected genes and most of the set were func:onally relevant – Also iden:fied a few interes:ng, novel genes
• Reconfirma:on based on Ambion sequences was rela:vely low (9/31)
Well-‐Level Modeling
• Started with 27 parameters from MetaXpress • Performed automated feature selec:on – Remove undefined, constant features – Manually removed a few highly correlated features
• Work with 12 parameters
• Convert to Z-‐scores • Posi:ve & nega:ve controls are nicely separated
All Wells Controls Wells
Parameter Distribu(ons
Model Performance
• Classifica:on model trained using the posi:ve (GMNN-‐like) and nega:ve (not GMNN-‐like) controls
• Perfect classifica:on! – Worrying – overfiqng? – Nearly, 99% of the control wells were confidently classified as a posi:ve or nega:ve
Posi-ve Nega-ve
Posi-ve 1504 0
Nega-ve 0 1504
Descriptor Importance
• What does the model iden:fy as the most relevant descriptors?
• Some parameters are moderately correlated
Cell.MitoticAverageIntensity
Cell.DNAAverageIntensity
X.SPhase
G2Cells
DNABackgroundValue
Cell.DNAArea
X.G0.G1
Cell.DNAIntegratedIntensity
Cell.MitoticIntegratedIntensity
X.G2
SPhaseCells
G0.G1Cells
0 100 200 300
MeanDecreaseGini
Random Forest Predic(ons
• We use the model to predict the class for all the remaining wells
• All four siRNA’s targe:ngGMNN are classified as Geminin-‐like with high confidence
Probability of being Geminin-like
Per
cent
of T
otal
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
Random Forest Predic(ons
• Select genes for which > 75% of its siRNA’s are predicted to be Geminin-‐like with probability > 0.8
• Good overlap with cell-‐level model
Pro
babi
lity
of b
eing
Gem
inin
-like
0.0
0.2
0.4
0.6
0.8
1.0
AURKA
AURKBBRD8
C8orf79
CDCA5
CDCA8CRAT
ESPL1F12
FBXO5
GMNNGUSB
INCENPITPKA JU
N
KCNH6KIF11MLL4
OR10A2PLK1
PSMA1
PSMB4
ROBO2
RPLP2SNRK
TOP2A
TRIM64 TT
KUBCWRN
GO Enrichment
• GO Biological Processes enriched by this set of selected genes, are relevant to the biology
• Similarly with pathways (from GeneGo)
Clustering
• RF classifica:on is useful, but doesn’t directly tell us much about finer groups of genes that might be phenotypically related
• So we apply unsupervised clustering (PAM) – Explore different numbers of clusters – Evaluate sta:s:cal cluster quality metrics – Evaluate biologically mo:vated quality metrics
• We considered both plate-‐wise and experiment-‐wise clustering protocols
Platewise Clustering (k=4)
• Cluster assignments can’t be directly compared across plates
• Good to see that control columns are dis:nctly clustered
• Certain plates show no membership to the ‘GMNN cluster’
Experimentwise Clustering (k=2)
• Encouraging to see clean separa:on between control columns
• Bulk of wells are iden:fied as inac:ve • We can compare results from this clustering to RF classifica:on – 6 genes iden:fied, with mul:ple siRNA’s clustered with nega:ve control
Experimentwise Clustering (k=2)
• 6 genes iden:fied with mul:ple siRNA’s clustered with the nega:ve control
• These were confidently iden:fied by the RF model
Pro
babi
lity
of b
eing
Gem
inin
-like
0.0
0.2
0.4
0.6
0.8
1.0
AURKA
AURKBBRD8
C8orf79
CDCA5
CDCA8CRAT
ESPL1F12
FBXO5
GMNNGUSB
INCENPITPKA JU
N
KCNH6KIF11MLL4
OR10A2PLK1
PSMA1
PSMB4
ROBO2
RPLP2SNRK
TOP2A
TRIM64 TT
KUBCWRN
How Many Clusters?
• A priori, difficult to decide how many clusters there should be – Manual spot checks did not iden:fy dis:nctly different morphologies, counts
• Evaluate clusters with varying k and calculate average silhoue`e width
• Clustering based on the Euclidean metric doesn’t do a good job
Number of Clusters
Ave
rage
Silh
ouet
te W
idth
0.2
0.3
0.4
0.5
0.6
0.7
2 5 8 11 14 17 20
How Many Clusters?
• One approach is to ignore clusterings that have spread all GMNN siRNAs across mul:ple clusters
• The current data suggests that we s:ck to k = 5
Biological Enrichment in Clusters
• Considering 5 clusters • Some clusters are annotated with more relevant terms
Cluster containing ¾ GMNN siRNAs
Signal Enhancement in Clusters
• Signal is significantly enhanced in some clusters versus others
• Clusters 1, 2 and 4 did not contain any siRNA’s above Z = 3
Making a Final Hitlist
• Off targets effects are a major confounding factor
• We are able to assess OTE on a gene by gene basis using Common Seed Analysis
• Select genes from individual clusters, using % G2 and number of siRNA’s as secondary filters
• Combine with hits from random forest model
Marine, S. et al, J. Biomol. Screen., 2011, ASAP
Reconfirma(on
• 18/211 genes selected based on thresholding from the primary reconfirmed using Ambion sequences
• Considering just the genes selected by the random forest and/or clustering methods – 11/30 genes selected by RF reconfirmed using Ambion libraries
– 5/6 Genes iden:fied by RF & clustering reconfirmed using mul:ple libraries • ESPL1, FBXO5, INCENP, KIF11 reconfirmed very strongly
• Based on k = 5 clustering, – 23/181 genes from cluster 3 reconfirmed – 5/5 genes from cluster 5 reconfirmed
Outlook
• Complements tradi:onal threshold based selec:on methods
• The random forest approach is sufficiently accurate and lets us avoid explicitly selec:ng features up front
• Combined with clustering lets us zoom into biological relevant clusters of genes
Acknowledgements
• Sco` Mar:n • Pinar Tuzmen • Carleen Klump • Eugen Buehler
top related