Classification and Clustering for Hit Identification in High Content RNAi Screens

Classifica(on and Clustering for Hit Iden(fica(on in High

Content RNAi Screens

Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs

January 11, 2012

DNA Re-replication

Sivaprasad et al Cell Division

DNA replication is a tightly controlled and well-studied process. Proteins including geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!

Levels of geminin increase as cells enter S phase, which help to prevent a second round of DNA replication.!

After mitosis, levels of geminin and cyclins decrease through ubiqutin mediated degradation.!

Collaborator:!Mel Depamphilis, NICHD!Wenge Zhu, Georgetown U!

DNA Re-replication

Certain cancer cells may have less safeguards against DNA re-replication than normal cells (i.e. Achilles heel). Induction of re-replication results in apoptosis.!

Zhu et al, Cancer Res, 2009

Screening Protocol

•  HCT-116 colon cancer cells are fixed and stained (Hoechst)!

•  Image at 4X on ImageXpress!

•  MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !

•  Screens were run with singles and pools

Screen Summary

•  Qiagen druggable genome library (6,866 genes) •  94 plates, 36K wells including controls

•  Good screen performance, some poorer plates were redone

Plate Index

Statistic

0 20 40 60 80 100

Trimmed Z'

12140 20 40 60 80 100

•  Can we iden:fy genes with GMNN-‐like phenotypes – We already iden:fied a set of genes via thresholding the %G2 parameter

– We’d like to see what we get when we use a mul:-‐dimensional representa:on

•  Employ predic:ve modeling to “learn” the phenotype

•  Apply clustering and iden:fy biologically relevant clusters

What Do GMNN Wells Look Like?

Cell-‐Level Modeling

•  A first approach was to match distribu:ons of individual wells with the overall distribu:on from the posi:ve control wells – Expected that distribu:on for GMNN wells should match the posi:ve control

– Use KS test to iden:fy wells with similar distribu:ons – Doesn’t work too well, even for GMNN itself – Considers 1 parameter at a :me (though a 2D KS test is possible)

Random Forest Model

•  Ensemble of decision trees (Breiman 1984) •  Not always the most accurate, but great for exploratory modeling –  Implicit feature selec:on – Proven to not overfit – Provides a measure of feature importance

•  Employ the randomForest package from R

h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html

•  Removed cells with “incomplete” parameters •  S:ll leaves 291K posi:ve cases and 3M nega:ve cases

•  Developed a random forest model, sampling from nega:ves to maintain balanced classes – Predict whether a cell is GMNN-‐like – Models from mul:ple samples of the nega:ve control exhibited similar performance

Posi-ve Nega-ve

Posi-ve 220,636 72,498

Nega-ve 35,614 257,520

Overall 18% error, 25% error on posi3ve class and 12% error on nega3ve class

•  Significant overlap between distribu:ons for the nega:ve and posi:ve controls

Cell-‐Level Predic(ons

•  Aggregate predic:ons for all cells in a well to label a well as GMNN-‐like

•  Iden:fy genes with >= 2 siRNA’s (ie wells) labeled as GMNN-‐like – 31 genes iden:fied (GMNN, KIF11, ESPL1, …)

•  Iden:fied expected genes and most of the set were func:onally relevant – Also iden:fied a few interes:ng, novel genes

•  Reconfirma:on based on Ambion sequences was rela:vely low (9/31)

Well-‐Level Modeling

•  Started with 27 parameters from MetaXpress •  Performed automated feature selec:on – Remove undefined, constant features – Manually removed a few highly correlated features

•  Work with 12 parameters

•  Convert to Z-‐scores •  Posi:ve & nega:ve controls are nicely separated

All Wells Controls Wells

Parameter Distribu(ons

Model Performance

•  Classifica:on model trained using the posi:ve (GMNN-‐like) and nega:ve (not GMNN-‐like) controls

•  Perfect classifica:on! – Worrying – overfiqng? – Nearly, 99% of the control wells were confidently classified as a posi:ve or nega:ve

Posi-ve Nega-ve

Posi-ve 1504 0

Nega-ve 0 1504

Descriptor Importance

•  What does the model iden:fy as the most relevant descriptors?

•  Some parameters are moderately correlated

Cell.MitoticAverageIntensity

Cell.DNAAverageIntensity

X.SPhase

G2Cells

DNABackgroundValue

Cell.DNAArea

X.G0.G1

Cell.DNAIntegratedIntensity

Cell.MitoticIntegratedIntensity

SPhaseCells

G0.G1Cells

0 100 200 300

MeanDecreaseGini

Random Forest Predic(ons

•  We use the model to predict the class for all the remaining wells

•  All four siRNA’s targe:ngGMNN are classified as Geminin-‐like with high confidence

Probability of being Geminin-like

0.0 0.2 0.4 0.6 0.8 1.0

Random Forest Predic(ons

•  Select genes for which > 75% of its siRNA’s are predicted to be Geminin-‐like with probability > 0.8

•  Good overlap with cell-‐level model

AURKBBRD8

C8orf79

CDCA8CRAT

ESPL1F12

GMNNGUSB

INCENPITPKA JU

KCNH6KIF11MLL4

OR10A2PLK1

RPLP2SNRK

TRIM64 TT

KUBCWRN

GO Enrichment

•  GO Biological Processes enriched by this set of selected genes, are relevant to the biology

•  Similarly with pathways (from GeneGo)

Clustering

•  RF classifica:on is useful, but doesn’t directly tell us much about finer groups of genes that might be phenotypically related

•  So we apply unsupervised clustering (PAM) – Explore different numbers of clusters – Evaluate sta:s:cal cluster quality metrics – Evaluate biologically mo:vated quality metrics

•  We considered both plate-‐wise and experiment-‐wise clustering protocols

Platewise Clustering (k=4)

•  Cluster assignments can’t be directly compared across plates

•  Good to see that control columns are dis:nctly clustered

•  Certain plates show no membership to the ‘GMNN cluster’

Experimentwise Clustering (k=2)

•  Encouraging to see clean separa:on between control columns

•  Bulk of wells are iden:fied as inac:ve •  We can compare results from this clustering to RF classifica:on – 6 genes iden:fied, with mul:ple siRNA’s clustered with nega:ve control

Experimentwise Clustering (k=2)

•  6 genes iden:fied with mul:ple siRNA’s clustered with the nega:ve control

•  These were confidently iden:fied by the RF model

AURKBBRD8

C8orf79

CDCA8CRAT

ESPL1F12

GMNNGUSB

INCENPITPKA JU

KCNH6KIF11MLL4

OR10A2PLK1

RPLP2SNRK

TRIM64 TT

KUBCWRN

How Many Clusters?

•  A priori, difficult to decide how many clusters there should be – Manual spot checks did not iden:fy dis:nctly different morphologies, counts

•  Evaluate clusters with varying k and calculate average silhoue`e width

•  Clustering based on the Euclidean metric doesn’t do a good job

Number of Clusters

2 5 8 11 14 17 20

How Many Clusters?

•  One approach is to ignore clusterings that have spread all GMNN siRNAs across mul:ple clusters

•  The current data suggests that we s:ck to k = 5

Biological Enrichment in Clusters

•  Considering 5 clusters •  Some clusters are annotated with more relevant terms

Cluster containing ¾ GMNN siRNAs

Signal Enhancement in Clusters

•  Signal is significantly enhanced in some clusters versus others

•  Clusters 1, 2 and 4 did not contain any siRNA’s above Z = 3

Making a Final Hitlist

•  Off targets effects are a major confounding factor

•  We are able to assess OTE on a gene by gene basis using Common Seed Analysis

•  Select genes from individual clusters, using % G2 and number of siRNA’s as secondary filters

•  Combine with hits from random forest model

Marine, S. et al, J. Biomol. Screen., 2011, ASAP

Reconfirma(on

•  18/211 genes selected based on thresholding from the primary reconfirmed using Ambion sequences

•  Considering just the genes selected by the random forest and/or clustering methods –  11/30 genes selected by RF reconfirmed using Ambion libraries

–  5/6 Genes iden:fied by RF & clustering reconfirmed using mul:ple libraries •  ESPL1, FBXO5, INCENP, KIF11 reconfirmed very strongly

•  Based on k = 5 clustering, –  23/181 genes from cluster 3 reconfirmed –  5/5 genes from cluster 5 reconfirmed

Outlook

•  Complements tradi:onal threshold based selec:on methods

•  The random forest approach is sufficiently accurate and lets us avoid explicitly selec:ng features up front

•  Combined with clustering lets us zoom into biological relevant clusters of genes

Acknowledgements

•  Sco` Mar:n •  Pinar Tuzmen •  Carleen Klump •  Eugen Buehler

Classification and Clustering for Hit Identification in High Content RNAi Screens

posive negave posive

ve gmnn

ve control posive negave

ve modeling

ve controls

ve control wells

ve cases

similar posive

Documents

RNAi Lab Manual

RNA interference (RNAi)

High throughput genetics & RNAi Screens Luke Lopas and Mark....

Prospects of RNAi Therapies

RNAi Hepatitis treatment

RNAi Mechanism

Joining the Dots: Integrating High Throughput Small Molecule...

RNAi technology

RNAi therapeutics: Principles, prospects and...

High-Content Chemical and RNAi Screens for Suppressors of...

Lecture 15: Functional Genomics II...Lecture 15: Functional....

Stealth RNAi Collections

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 10...

RNAi nterference ( RNAi )

False negative rates in Drosophila cell-based RNAi screens:....

2014 T&C RNAI