© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

© 2001, Boehringer, Inc. - All Rights Reserved.

SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications

Presented at Spotfire Users Conference

Jun Xu

Boehringer Ingelheim Pharmaceuticals, Inc.

May 3, 2001

Introduction: Diversity & Drug Design

• Lead Screening– Select compounds for UHTS– Select compounds for acquisition

• Combinatorial Library Design– Compare virtual libraries– Compare virtual libraries against existing

inventory– Select sub-library to make

Importance of Data Visualization

• Graphically review structural diversity

• Graphically filter unwanted compounds

• Graphically select sub-set

• Graphically study the relations between structure and activity

Challenge!

• Chemical structures are graphs

• The number of compounds in a library can be very large

Solution to study the diversity of a large compound library

conventional methods

DescribingDescriptorssuch as, fingerprints,topological indexes,properties, similarities,distances, etc.

Dimension Reduction Transformed datamatrix (redundancyis filtered, numberof dimensions is reduced)

X

Y

N

O

N

O

N

O

O

N

O

N

O

N

O

O

N

O

N

O

N

O

O

Compound Library

Mapping• Principal Component Analysis (PCA)

– Transform a matrix M(m,n) to M’(m,n’)

– The n’ dismensions are sorted based on the eigenvalues

– If the top-three dimensions can explain >85% of the data, the M’(m,3) is the fair approximation of M(m,n), otherwise PCA cannot be used for mapping

• Multi-Dimensional Scaling (MDS)– Based on distance matrix

– Convert M(m,n) to M’(m,2) in an irrational method

• One of the Problems– The new dimensions have no chemical/physical

meaning

An example of mapping

Clustering

• To divide n objects into m bins (nm)

• The clustering is pattern recognition

• The clustering can be a unsupervised learning

General steps for clustering

• Select the data of describing objects

• Extract patterns from the data– normalizing rows

– normalizing columns

– normalizing methods

• Measure Similarity

• Select a proper and robust clustering method

Problems in conventional methods

• Selecting and computing “correct” descriptors are difficult and time-consuming

• Hierarchical algorithms force “dogs” and “cats” to be together

• Non-hierarchical algorithms ask for “number of clusters” and other settings

• SOM method asks you to set at least eight irrational parameters

How many do you want?How many clustersare in my library?

...

K-mean cluster:

K-mean and K-nearest Neighbor Approaches

• Assuming the number of clusters is known

• Computing complexity:

Nj represents the number of jth combinations in k clusters (groups)

n represents the number of objects

ni represents the number of objects in the ith cluster

k represents the number of clusters

It is NP-complete problem

!!...!

!

21 kj nnn

nN

Self Organization Map (SOM) Approach

• To run SOM, 8 parameters have to be set up properly as follows:

– Data Initialization: random or ordered

– Neighborhood function: Bubble or Gaussian

– Neuron topology: hexagonal or Rectangular

– Neural dimensions: X and Y (how many cells/neurons)

– Number of training steps: such as, 10,000

– Initial learning rate: such as, 0.03

– Initial radius of training area: such as, 10

– Monitoring parameter: number of steps for generating 2D points on a plane, such as, 100

S-Cluster: New approach

• No need to compute descriptors

• No need to give the number of clusters

• Faster

• Rational parameters

• Results are explained chemically

S-Cluster Algorithm (1)

• Extract scaffolds

• Reference scaffold (Sv):– number of smallest set of smallest rings (sssrs)

– number of non-H atoms (atoms)

– number of bonds (excluding H bonds) (bonds)

– sum of non-H atomic numbers (zs)

– Vv = { sssrs, atoms, bonds, zs }Sv

Deriving Scaffolds

NN

O

ClN

N

Prune side chains

Structure Scaffold

Linker Bonds

Ring Bonds

Chain Bonds

Chain Bonds


• The complexity of a structure:

Vi = { sssrs, atoms, bonds, zs }Si for Si

Vv = { sssrs, atoms, bonds, zs } from a reference scaffold

Pi = || Vv + Vi ||

Mi = || Vv - Vi ||

i

iii P

MPSSimilarity

)(


• The “Cyclicity” of a structure– The sum of heavy atomic numbers (a)

– The umber of rotating bonds ( r )

– The number of 1-degree nodes (d1)

– The number of double bonds (db)

– The number of triple bonds (tb)

– The number of 2-degree nodes (d2)

– Vs = { a, r, d1, db, tb, d2 } saffold

– Vi = { a, r, d1, db, tb, d2 } structure(i)

si

sisii VV

VVVVSCyclicity

)(

Results and discussions

• Cluster following libraries together:– ACD (250,468 structures)

– NCI (126,554, MDL 1994)

– CMC (4591 oral drugs)

– MDDR (6347 launch or pre-clinical drugs or compounds)

• Cluster all 387,960 structures on an NT laptop (Compaq, Armada E700)

• Running time: 1 h 42 mins

Cyclicity vs Complexity

Most complicated structure is on the upper-right

O

N

O

N

O

PO O

O

OP

N

O O

O

O

O

O

P

N

O

N

OO

O

P

NO

O

O

O

N

O

O

N

N

O

O

PN

O

N

O

OO

P

N

O

OO

O

NO

O

N

N

O

O

P

N

O

N

O

O

O

P

N

O

O O

O

N

O

O

N

N

O

O

P

N

O

N

OO

O

P

N

O

O

OO

N

O

O

N

N

O

O

P

N

O

N

O

OO

P

N

O

O

OO

N

O

N

N

O

O

NN

O

N

N+

Most chain-like structure is on the bottom-left

O

O

OO

OO

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

Zoom-in: Substituent Patterns

1

23

4

5

67

T a b l e 1 . S o m e C h e m i c a l D i v e r s i t y P a t t e r n s ( E x a m p l e s a r e f r o m A C D d a t a b a s e )P a t t e r n # M e a n i n g E x a m p l e

1 S i n g l e L i s u b s t i t u t e dc o m p o u n d s

N

L i2 S i n g l e m e t h y l s u b s t i t u t e d

c o m p o u n d s , s o m e t i m e se t h y l e n e - s u b s t i t u t e dc o m p o u n d s

N

O3 S i n g l e p r i m a r y a m i n e

c o m p o u n d sO

O

N

O

4 S i n g l e c a r b o n y l g r o u p o ra l c o h o l g r o u p c o m p o u n d s N

N

S

O

O

NO

5 S i n g l e f l u o r i d e s u b s t i t u t e dc o m p o u n d s

F

6 D o u b l e m e t h y l s u b s t i t u t e d o rs i n g l e e t h y l s u b s t i t u t e dc o m p o u n d s

NN N

7 S i n g l e p r i m a r y a m i n e a n ds i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s

N

N

N

N

P a t t e r n # M e a n i n g E x a m p l e1 S i n g l e L i s u b s t i t u t e d

c o m p o u n d s

N

L i

2 S i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s , s o m e t i m e se t h y l e n e - s u b s t i t u t e dc o m p o u n d s

N

O

3 S i n g l e p r i m a r y a m i n ec o m p o u n d s O

O

N

O

4 S i n g l e c a r b o n y l g r o u p o ra l c o h o l g r o u p c o m p o u n d s N

N

S

O

O

NO

5 S i n g l e f l u o r i d e s u b s t i t u t e dc o m p o u n d s

F

6 D o u b l e m e t h y l s u b s t i t u t e d o rs i n g l e e t h y l s u b s t i t u t e dc o m p o u n d s

NN N

7 S i n g l e p r i m a r y a m i n e a n ds i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s

N

N

N

N

Diversity “Island” and “Density”

N a

A B

A: Single O substituents B: Single F substituents

“Cyclicity” vs Average Electronegativity

“Cyclicity” vs H-Bond Donors

O

OO

O

Cl

O

O

O

ClCl

Cl

O O

Reagent Selector(R) Clustering Result(Jarvis-Patrick Method):Input 116 compounds, Ask for 26 clustersThis is cluster 2

O

O

O O

O

OO

O

Result from the S-Cluster Algorithm:Input 116 compounds, 26 clusters were foundThis is cluster 2

Applications

• Evaluate libraries

• Compare libraries

• Design a focused library

Blue: Virtual Library Red: Target Library

The optimized sub-library to be made from the virtual library

But, if you still want to cluster molecules (genes, or small molecules) based upon their property/activity arrays...

We have V-Cluster (Vector Cluster Algorithm)for these requirement, it will be presented later

Conclusions

• We emphasize on finding natural clusters

• There must be chemical/physical explanations for computational results

• Before a software “button” is pushed, the mathematical/chemical/physical/biological meaning should be understood

• Good algorithm should be robust

Acknowledgements

• Cheminformatics/Medicinal Chemistry– Dr. Qiang Zhang– Dr. Hans Briem– Dr. Ron Magolda

© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Documents