371 Similarity

Similarity Methods

C371

Fall 2004

Limitations of Substructure Searching/3D Pharmacophore Searching

• Need to know what you are looking for

• Compound is either there or not– Don’t get a feel for the relative ranking of the

compounds

• Output size can be a problem

Similarity Searching

• Look for compounds that are most similar to the query compound

• Each compound in the database is ranked

• In other application areas, the technique is known as pattern matching or signature analysis

Similar Property Principle

• Structurally similar molecules usually have similar properties, e.g., biological activity

• Known also as “neighborhood behavior”

• Examples: morphine, codeine, heroin

• Define: in silico– Using computational techniques as a

substitute for or complement to experimental methods

Advantages of Similarity Searching

• One known active compound becomes the search key

• User sets the limits on output

• Possible to re-cycle the top answers to find other possibilities

• Subjective determination of the degree of similarity

Applications of Similarity Searching

• Evaluation of the uniqueness of proposed or newly synthesized compounds

• Finding starting materials or intermediates in synthesis design

• Handling of chemical reactions and mixtures

• Finding the right chemicals for one’s needs, even if not sure what is needed.

Subjective Nature of Similarity Searching

• No hard and fast rules

• Numerical descriptors are used to compare molecules

• A similarity coefficient is defined to quantify the degree of similarity

• Similarity and dissimilarity rankings can be different in principle

Similarity and Dissimilarity

“Consider two objects A and B, a is the number of features (characteristics) present in A and absent in B, b is the number of features absent in A and present in B, c is the number of features common to both objects, and d is the number of features absent from both objects. Thus, c and d measure the present and the absent matches, respectively, i.e., similarity; while a and b measure the corresponding mismatches, i.e., dissimilarity.” (Chemoinformatics; A Textbook (2003), p. 304)

2D Similarity Measures

• Commonly based on “fingerprints,” binary vectors with 1 indicating the presence of the fragment and 0 the absence

• Could relate structural keys, hashed fingerprints, or continuous data (e.g., topological indexes that take into acount size, degree of branching, and overall shape)

Tanimoto Coefficient

• Tanimoto Coefficient of similarity for Molecules A and B:

SAB = c _

a + b – ca = bits set to 1 in A, b = bits set to 1 in B, c =

number of 1 bits common to both

Range is 0 to 1.

Value of 1 does not mean the molecules are identical.

Similarity Coefficients

• Tanimoto coefficient is most widely used for binary fingerprints

• Others:– Dice coefficient– Cosine similarity– Euclidean distance– Hamming distance– Soergel distance

Distance Between Pairs of Molecules

• Used to define dissimilarity of molecules

• Regards a common absence of a feature as evidence of similarity

When is a distance coefficient a metric?

• Distance values must be zero or positive– Distance from an object to itself must be zero

• Distance values must be symmetric• Distance values must obey the triangle

inequality: DAB ≤ DAC + DBC

• Distance between non-identical objects must be greater than zero.

• Dissimilarity = distance in the n-dimensional descriptor space

Size Dependency of the Measures

• Small molecules often have lower similarity values using Tanimoto

• Tanimoto normalizes the degree of size in the denominator:

SAB = c _

a + b – c

Other 2D Descriptor Methods

• Similarity can be based on continuous whole molecule properties, e.g. logP, molar refractivity, topological indexes.

• Usual approach is to use a distance coefficient, such as Euclidean distance.

Maximum Common Subgraph Similarity

• Another approach: generate alignment between the molecules (mapping)

• Define MCS: largest set of atoms and bonds in common between the two structures.

• A Non-Polynomial- (NP)-complete problem: very computer intensive; in the worst case, the algorithm will have an exponential computational complexity

• Tricks are used to cut down on the computer usage

Maximum Common Subgraph

Reduced Graph Similarity

• A structure’s key features are condensed while retaining the connections between them

• Cen ID structures with similar binding characteristics, but different underlying skeletons

• Smaller number of nodes speeds up searching

3D Similarity

• Aim is often to identify structurally different molecules

• 3D methods require consideration of the conformational properties of molecules

Tanimoto Coefficient to Find Compounds Similar to Morphine

3D: Alignment-Independent Methods

• Descriptors: geometric atom pairs and their distances, valence and torsion angles, atom triplets

• Consideration of conformational flexibility increases greatly the compute time

• Relatively fewer pharmacophoric fingerprints than 2D fingerprints– Result: Low similarity values using Tanimoto

Pharmacophore

• A structural abstraction of the interactions between various functional group types in a compound

• Described by a spatial representation of these groups as centers (or vertices) of geometrical polyhedra, together with pairwise distances between centers

• http://www.ma.psu.edu/~csb15/pubs/searle.pdf

http://www.ma.psu.edu/~csb15/pubs/searle.pdf

http://www.ma.psu.edu/~csb15/pubs/searle.pdf

3D: Alignment Methods

• Require consideration of the degrees of freedom related to the conformational flexibility of the molecules

• Goal: determine the alignment where similarity measure is at a maximum

3D: Field-Based Alignment Methods

• Consideration of the electron density of the molecules– Requires quantum mechanical calculation:

costly– Property not sufficiently discriminatory

3D: Gnomonic Projection Methods

• Molecule positioned at the center of a sphere and properties projected on the surface

• Sphere approximated by a tessellated icosahedron or dodecahedron

• Each triangular face is divided into a series of smaller triangles

Finding the Optimal Alignment

• Need a mechanism for exploring the orientational (and conformational) degrees of freedon for determining the optimal alignment where the similarity is maximized

• Methods: simplex algorithm, Monte Carlo methods, genetic alrogithms

Evaluation of Similarity Methods

• Generally, 2D methods are more effective that 3D– 2D methods may be artificially enhanced

because of database characteristics (close analogs)

– Incomplete handling of conformational flexibility in 3D databases

• Best to use data fusion techniques, combining methods

For additional information . . .

• See Dr. John Barnard’s lecture at:http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt

http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt

371 Similarity

Documents