Top Banner
© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users Conference Jun Xu Boehringer Ingelheim Pharmaceuticals, Inc. May 3, 2001
37

© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Jan 04, 2016

Download

Documents

Conrad Norris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

© 2001, Boehringer, Inc. - All Rights Reserved.

SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications

Presented at Spotfire Users Conference

Jun Xu

Boehringer Ingelheim Pharmaceuticals, Inc.

May 3, 2001

Page 2: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Introduction: Diversity & Drug Design

• Lead Screening– Select compounds for UHTS– Select compounds for acquisition

• Combinatorial Library Design– Compare virtual libraries– Compare virtual libraries against existing

inventory– Select sub-library to make

Page 3: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Importance of Data Visualization

• Graphically review structural diversity

• Graphically filter unwanted compounds

• Graphically select sub-set

• Graphically study the relations between structure and activity

Page 4: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Challenge!

• Chemical structures are graphs

• The number of compounds in a library can be very large

Page 5: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Solution to study the diversity of a large compound library

conventional methods

DescribingDescriptorssuch as, fingerprints,topological indexes,properties, similarities,distances, etc.

Dimension Reduction Transformed datamatrix (redundancyis filtered, numberof dimensions is reduced)

X

Y

N

O

N

O

N

O

O

N

O

N

O

N

O

O

N

O

N

O

N

O

O

Compound Library

Page 6: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Mapping• Principal Component Analysis (PCA)

– Transform a matrix M(m,n) to M’(m,n’)

– The n’ dismensions are sorted based on the eigenvalues

– If the top-three dimensions can explain >85% of the data, the M’(m,3) is the fair approximation of M(m,n), otherwise PCA cannot be used for mapping

• Multi-Dimensional Scaling (MDS)– Based on distance matrix

– Convert M(m,n) to M’(m,2) in an irrational method

• One of the Problems– The new dimensions have no chemical/physical

meaning

Page 7: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

An example of mapping

Page 8: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Clustering

• To divide n objects into m bins (nm)

• The clustering is pattern recognition

• The clustering can be a unsupervised learning

Page 9: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

General steps for clustering

• Select the data of describing objects

• Extract patterns from the data– normalizing rows

– normalizing columns

– normalizing methods

• Measure Similarity

• Select a proper and robust clustering method

Page 10: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Problems in conventional methods

• Selecting and computing “correct” descriptors are difficult and time-consuming

• Hierarchical algorithms force “dogs” and “cats” to be together

• Non-hierarchical algorithms ask for “number of clusters” and other settings

• SOM method asks you to set at least eight irrational parameters

Page 11: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

How many do you want?How many clustersare in my library?

...

K-mean cluster:

Page 12: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

K-mean and K-nearest Neighbor Approaches

• Assuming the number of clusters is known

• Computing complexity:

Nj represents the number of jth combinations in k clusters (groups)

n represents the number of objects

ni represents the number of objects in the ith cluster

k represents the number of clusters

It is NP-complete problem

!!...!

!

21 kj nnn

nN

Page 13: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Self Organization Map (SOM) Approach

• To run SOM, 8 parameters have to be set up properly as follows:

– Data Initialization: random or ordered

– Neighborhood function: Bubble or Gaussian

– Neuron topology: hexagonal or Rectangular

– Neural dimensions: X and Y (how many cells/neurons)

– Number of training steps: such as, 10,000

– Initial learning rate: such as, 0.03

– Initial radius of training area: such as, 10

– Monitoring parameter: number of steps for generating 2D points on a plane, such as, 100

Page 14: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

S-Cluster: New approach

• No need to compute descriptors

• No need to give the number of clusters

• Faster

• Rational parameters

• Results are explained chemically

Page 15: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

S-Cluster Algorithm (1)

• Extract scaffolds

• Reference scaffold (Sv):– number of smallest set of smallest rings (sssrs)

– number of non-H atoms (atoms)

– number of bonds (excluding H bonds) (bonds)

– sum of non-H atomic numbers (zs)

– Vv = { sssrs, atoms, bonds, zs }Sv

Page 16: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Deriving Scaffolds

NN

O

ClN

N

Prune side chains

Structure Scaffold

Linker Bonds

Ring Bonds

Chain Bonds

Chain Bonds

Page 17: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

S-Cluster Algorithm (2)

• The complexity of a structure:

Vi = { sssrs, atoms, bonds, zs }Si for Si

Vv = { sssrs, atoms, bonds, zs } from a reference scaffold

Pi = || Vv + Vi ||

Mi = || Vv - Vi ||

i

iii P

MPSSimilarity

)(

Page 18: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

S-Cluster Algorithm (3)

• The “Cyclicity” of a structure– The sum of heavy atomic numbers (a)

– The umber of rotating bonds ( r )

– The number of 1-degree nodes (d1)

– The number of double bonds (db)

– The number of triple bonds (tb)

– The number of 2-degree nodes (d2)

– Vs = { a, r, d1, db, tb, d2 } saffold

– Vi = { a, r, d1, db, tb, d2 } structure(i)

si

sisii VV

VVVVSCyclicity

)(

Page 19: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Results and discussions

• Cluster following libraries together:– ACD (250,468 structures)

– NCI (126,554, MDL 1994)

– CMC (4591 oral drugs)

– MDDR (6347 launch or pre-clinical drugs or compounds)

• Cluster all 387,960 structures on an NT laptop (Compaq, Armada E700)

• Running time: 1 h 42 mins

Page 20: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Cyclicity vs Complexity

Page 21: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Most complicated structure is on the upper-right

O

N

O

N

O

PO O

O

OP

N

O O

O

O

O

O

P

N

O

N

OO

O

P

NO

O

O

O

N

O

O

N

N

O

O

PN

O

N

O

OO

P

N

O

OO

O

NO

O

N

N

O

O

P

N

O

N

O

O

O

P

N

O

O O

O

N

O

O

N

N

O

O

P

N

O

N

OO

O

P

N

O

O

OO

N

O

O

N

N

O

O

P

N

O

N

O

OO

P

N

O

O

OO

N

O

N

N

O

O

NN

O

N

N+

Page 22: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Most chain-like structure is on the bottom-left

O

O

OO

OO

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

Page 23: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Zoom-in: Substituent Patterns

1

23

4

5

67

Page 24: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

T a b l e 1 . S o m e C h e m i c a l D i v e r s i t y P a t t e r n s ( E x a m p l e s a r e f r o m A C D d a t a b a s e )P a t t e r n # M e a n i n g E x a m p l e

1 S i n g l e L i s u b s t i t u t e dc o m p o u n d s

N

L i2 S i n g l e m e t h y l s u b s t i t u t e d

c o m p o u n d s , s o m e t i m e se t h y l e n e - s u b s t i t u t e dc o m p o u n d s

N

O3 S i n g l e p r i m a r y a m i n e

c o m p o u n d sO

O

N

O

4 S i n g l e c a r b o n y l g r o u p o ra l c o h o l g r o u p c o m p o u n d s N

N

S

O

O

NO

5 S i n g l e f l u o r i d e s u b s t i t u t e dc o m p o u n d s

F

6 D o u b l e m e t h y l s u b s t i t u t e d o rs i n g l e e t h y l s u b s t i t u t e dc o m p o u n d s

NN N

7 S i n g l e p r i m a r y a m i n e a n ds i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s

N

N

N

N

Page 25: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

P a t t e r n # M e a n i n g E x a m p l e1 S i n g l e L i s u b s t i t u t e d

c o m p o u n d s

N

L i

2 S i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s , s o m e t i m e se t h y l e n e - s u b s t i t u t e dc o m p o u n d s

N

O

3 S i n g l e p r i m a r y a m i n ec o m p o u n d s O

O

N

O

4 S i n g l e c a r b o n y l g r o u p o ra l c o h o l g r o u p c o m p o u n d s N

N

S

O

O

NO

5 S i n g l e f l u o r i d e s u b s t i t u t e dc o m p o u n d s

F

6 D o u b l e m e t h y l s u b s t i t u t e d o rs i n g l e e t h y l s u b s t i t u t e dc o m p o u n d s

NN N

7 S i n g l e p r i m a r y a m i n e a n ds i n g l e m e t h y l s u b s t i t u t e dc o m p o u n d s

N

N

N

N

Page 26: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Diversity “Island” and “Density”

N a

A B

A: Single O substituents B: Single F substituents

Page 27: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

“Cyclicity” vs Average Electronegativity

Page 28: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

“Cyclicity” vs H-Bond Donors

Page 29: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

O

OO

O

Cl

O

O

O

ClCl

Cl

O O

Reagent Selector(R) Clustering Result(Jarvis-Patrick Method):Input 116 compounds, Ask for 26 clustersThis is cluster 2

Page 30: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

O

O

O O

O

OO

O

Result from the S-Cluster Algorithm:Input 116 compounds, 26 clusters were foundThis is cluster 2

Page 31: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Applications

• Evaluate libraries

• Compare libraries

• Design a focused library

Page 32: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.
Page 33: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Blue: Virtual Library Red: Target Library

Page 34: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

The optimized sub-library to be made from the virtual library

Page 35: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

But, if you still want to cluster molecules (genes, or small molecules) based upon their property/activity arrays...

We have V-Cluster (Vector Cluster Algorithm)for these requirement, it will be presented later

Page 36: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Conclusions

• We emphasize on finding natural clusters

• There must be chemical/physical explanations for computational results

• Before a software “button” is pushed, the mathematical/chemical/physical/biological meaning should be understood

• Good algorithm should be robust

Page 37: © 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Acknowledgements

• Cheminformatics/Medicinal Chemistry– Dr. Qiang Zhang– Dr. Hans Briem– Dr. Ron Magolda