© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users Conference Jun Xu Boehringer Ingelheim Pharmaceuticals, Inc. May 3, 2001
© 2001, Boehringer, Inc. - All Rights Reserved.
SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications
Presented at Spotfire Users Conference
Jun Xu
Boehringer Ingelheim Pharmaceuticals, Inc.
May 3, 2001
Introduction: Diversity & Drug Design
• Lead Screening– Select compounds for UHTS– Select compounds for acquisition
• Combinatorial Library Design– Compare virtual libraries– Compare virtual libraries against existing
inventory– Select sub-library to make
Importance of Data Visualization
• Graphically review structural diversity
• Graphically filter unwanted compounds
• Graphically select sub-set
• Graphically study the relations between structure and activity
Challenge!
• Chemical structures are graphs
• The number of compounds in a library can be very large
Solution to study the diversity of a large compound library
conventional methods
DescribingDescriptorssuch as, fingerprints,topological indexes,properties, similarities,distances, etc.
Dimension Reduction Transformed datamatrix (redundancyis filtered, numberof dimensions is reduced)
X
Y
N
O
N
O
N
O
O
N
O
N
O
N
O
O
N
O
N
O
N
O
O
Compound Library
Mapping• Principal Component Analysis (PCA)
– Transform a matrix M(m,n) to M’(m,n’)
– The n’ dismensions are sorted based on the eigenvalues
– If the top-three dimensions can explain >85% of the data, the M’(m,3) is the fair approximation of M(m,n), otherwise PCA cannot be used for mapping
• Multi-Dimensional Scaling (MDS)– Based on distance matrix
– Convert M(m,n) to M’(m,2) in an irrational method
• One of the Problems– The new dimensions have no chemical/physical
meaning
An example of mapping
Clustering
• To divide n objects into m bins (nm)
• The clustering is pattern recognition
• The clustering can be a unsupervised learning
General steps for clustering
• Select the data of describing objects
• Extract patterns from the data– normalizing rows
– normalizing columns
– normalizing methods
• Measure Similarity
• Select a proper and robust clustering method
Problems in conventional methods
• Selecting and computing “correct” descriptors are difficult and time-consuming
• Hierarchical algorithms force “dogs” and “cats” to be together
• Non-hierarchical algorithms ask for “number of clusters” and other settings
• SOM method asks you to set at least eight irrational parameters
How many do you want?How many clustersare in my library?
...
K-mean cluster:
K-mean and K-nearest Neighbor Approaches
• Assuming the number of clusters is known
• Computing complexity:
Nj represents the number of jth combinations in k clusters (groups)
n represents the number of objects
ni represents the number of objects in the ith cluster
k represents the number of clusters
It is NP-complete problem
!!...!
!
21 kj nnn
nN
Self Organization Map (SOM) Approach
• To run SOM, 8 parameters have to be set up properly as follows:
– Data Initialization: random or ordered
– Neighborhood function: Bubble or Gaussian
– Neuron topology: hexagonal or Rectangular
– Neural dimensions: X and Y (how many cells/neurons)
– Number of training steps: such as, 10,000
– Initial learning rate: such as, 0.03
– Initial radius of training area: such as, 10
– Monitoring parameter: number of steps for generating 2D points on a plane, such as, 100
S-Cluster: New approach
• No need to compute descriptors
• No need to give the number of clusters
• Faster
• Rational parameters
• Results are explained chemically
S-Cluster Algorithm (1)
• Extract scaffolds
• Reference scaffold (Sv):– number of smallest set of smallest rings (sssrs)
– number of non-H atoms (atoms)
– number of bonds (excluding H bonds) (bonds)
– sum of non-H atomic numbers (zs)
– Vv = { sssrs, atoms, bonds, zs }Sv
Deriving Scaffolds
NN
O
ClN
N
Prune side chains
Structure Scaffold
Linker Bonds
Ring Bonds
Chain Bonds
Chain Bonds
S-Cluster Algorithm (2)
• The complexity of a structure:
Vi = { sssrs, atoms, bonds, zs }Si for Si
Vv = { sssrs, atoms, bonds, zs } from a reference scaffold
Pi = || Vv + Vi ||
Mi = || Vv - Vi ||
i
iii P
MPSSimilarity
)(
S-Cluster Algorithm (3)
• The “Cyclicity” of a structure– The sum of heavy atomic numbers (a)
– The umber of rotating bonds ( r )
– The number of 1-degree nodes (d1)
– The number of double bonds (db)
– The number of triple bonds (tb)
– The number of 2-degree nodes (d2)
– Vs = { a, r, d1, db, tb, d2 } saffold
– Vi = { a, r, d1, db, tb, d2 } structure(i)
si
sisii VV
VVVVSCyclicity
)(
Results and discussions
• Cluster following libraries together:– ACD (250,468 structures)
– NCI (126,554, MDL 1994)
– CMC (4591 oral drugs)
– MDDR (6347 launch or pre-clinical drugs or compounds)
• Cluster all 387,960 structures on an NT laptop (Compaq, Armada E700)
• Running time: 1 h 42 mins
Cyclicity vs Complexity
Most complicated structure is on the upper-right
O
N
O
N
O
PO O
O
OP
N
O O
O
O
O
O
P
N
O
N
OO
O
P
NO
O
O
O
N
O
O
N
N
O
O
PN
O
N
O
OO
P
N
O
OO
O
NO
O
N
N
O
O
P
N
O
N
O
O
O
P
N
O
O O
O
N
O
O
N
N
O
O
P
N
O
N
OO
O
P
N
O
O
OO
N
O
O
N
N
O
O
P
N
O
N
O
OO
P
N
O
O
OO
N
O
N
N
O
O
NN
O
N
N+
Most chain-like structure is on the bottom-left
O
O
OO
OO
O
O
OO
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
Zoom-in: Substituent Patterns
1
23
4
5
67
T a b l e 1 . S o m e C h e m i c a l D i v e r s i t y P a t t e r n s ( E x a m p l e s a r e f r o m A C D d a t a b a s e )P a t t e r n # M e a n i n g E x a m p l e
1 S i n g l e L i s u b s t i t u t e dc o m p o u n d s
N
L i2 S i n g l e m e t h y l s u b s t i t u t e d
c o m p o u n d s , s o m e t i m e se t h y l e n e - s u b s t i t u t e dc o m p o u n d s
N
O3 S i n g l e p r i m a r y a m i n e
c o m p o u n d sO
O
N
O
4 S i n g l e c a r b o n y l g r o u p o ra l c o h o l g r o u p c o m p o u n d s N
N
S
O
O
NO
5 S i n g l e f l u o r i d e s u b s t i t u t e dc o m p o u n d s
F
6 D o u b l e m e t h y l s u b s t i t u t e d o rs i n g l e e t h y l s u b s t i t u t e dc o m p o u n d s
NN N
7 S i n g l e p r i m a r y a m i n e a n ds i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s
N
N
N
N
P a t t e r n # M e a n i n g E x a m p l e1 S i n g l e L i s u b s t i t u t e d
c o m p o u n d s
N
L i
2 S i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s , s o m e t i m e se t h y l e n e - s u b s t i t u t e dc o m p o u n d s
N
O
3 S i n g l e p r i m a r y a m i n ec o m p o u n d s O
O
N
O
4 S i n g l e c a r b o n y l g r o u p o ra l c o h o l g r o u p c o m p o u n d s N
N
S
O
O
NO
5 S i n g l e f l u o r i d e s u b s t i t u t e dc o m p o u n d s
F
6 D o u b l e m e t h y l s u b s t i t u t e d o rs i n g l e e t h y l s u b s t i t u t e dc o m p o u n d s
NN N
7 S i n g l e p r i m a r y a m i n e a n ds i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s
N
N
N
N
Diversity “Island” and “Density”
N a
A B
A: Single O substituents B: Single F substituents
“Cyclicity” vs Average Electronegativity
“Cyclicity” vs H-Bond Donors
O
OO
O
Cl
O
O
O
ClCl
Cl
O O
Reagent Selector(R) Clustering Result(Jarvis-Patrick Method):Input 116 compounds, Ask for 26 clustersThis is cluster 2
O
O
O O
O
OO
O
Result from the S-Cluster Algorithm:Input 116 compounds, 26 clusters were foundThis is cluster 2
Applications
• Evaluate libraries
• Compare libraries
• Design a focused library
Blue: Virtual Library Red: Target Library
The optimized sub-library to be made from the virtual library
But, if you still want to cluster molecules (genes, or small molecules) based upon their property/activity arrays...
We have V-Cluster (Vector Cluster Algorithm)for these requirement, it will be presented later
Conclusions
• We emphasize on finding natural clusters
• There must be chemical/physical explanations for computational results
• Before a software “button” is pushed, the mathematical/chemical/physical/biological meaning should be understood
• Good algorithm should be robust
Acknowledgements
• Cheminformatics/Medicinal Chemistry– Dr. Qiang Zhang– Dr. Hans Briem– Dr. Ron Magolda