…ask more of your data 1 Bayesian Learning • Build a model which estimates the likelihood that a given data sample is from a "good" subset of a larger set of samples (classification learning) • SciTegic uses modified Naïve Bayesian statistics – Efficient: • scales linearly with large data sets – Robust: • works for a few as well as many ‘good’ examples – Unsupervised: • no tuning parameters needed – Multimodal: • can model broad classes of compounds • multiple modes of action represented in a single model
Bayesian Learning. Build a model which estimates the likelihood that a given data sample is from a "good" subset of a larger set of samples ( classification learning ) SciTegic uses modified Naïve Bayesian statistics Efficient: scales linearly with large data sets Robust: - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
…ask more of your data
1
Bayesian Learning
• Build a model which estimates the likelihood that a given data sample is from a "good" subset of a larger set of samples (classification learning)
• works for a few as well as many ‘good’ examples– Unsupervised:
• no tuning parameters needed– Multimodal:
• can model broad classes of compounds • multiple modes of action represented in a single model
…ask more of your data
2
Learn Good from Bad
• “Learn Good from Bad” examines what distinguishes “good” from “baseline” compounds– Molecular properties (molecular weight, alogp, etc)– Molecular fingerprints
Baseline
O
N
AA
“Good”
…ask more of your data
3
Learning: “Learn Good From Bad”
• User provides name for new component and a “Test for good”, e.g.:– Activity > 0.5– Conclusion EQ ‘CA’
• User specifies properties– Typical: fingerprints, alogp,
donors/acceptors, number of rotatable bonds, etc.
• Model is new component
• Component calculates a number– The larger the number, the
more likely a sample is “good”
…ask more of your data
4
Using the model
• Model can be used to prioritize samples for screening, or search vendor libraries for new candidates for testing
• Quality of model can be evaluated:– Split data into training and test sets– Build model using training set– Sort test set using model value– Plot how rapidly hits are found in sorted list
…ask more of your data
5
Using a Learned Model
• Model appears on your tab in LearnedProperties
– Drag it into a protocol to use it “by value”
– Refer to it by name to use it “by reference”
6
Fingerprints
…ask more of your data
7
ECFP: Extended Connectivity Fingerprints
• New class of fingerprints for molecular characterization– Each bit represents the presence of a structural (not
substructural) feature– 4 Billion different bits– Multiple levels of abstraction contained in single FP– Different starting atom codes lead to different
fingerprints (ECFP, FCFP, ...)– Typical molecule generates 100s - 1000s of bits– Typical library generates 100K - 10M different bits.
…ask more of your data
8
Advantages
• Fast to calculate
• Represents much larger number of features
• Features not "pre-selected"
• Represents tertiary/quaternary information– Opposed to path based fp’s
• Bits can be “interpreted”
…ask more of your data
9
FCFP: Initial Atom Codes
O
N
1616
1616
16
0
1
3
FCFP Atom code bits from: 1: Has lone pairs 2: Is H-bond donor 4: Is negative ionizable 8: Is positive ionizable 16: Is aromatic 32: Is halogen
…ask more of your data
10
ECFP: Generating the Fingerprint
• Iteration is repeated desired number of times– Each iteration extends the diameter by two bonds
• A histogram can visually show the separation of actives and nonactives using a model
…ask more of your data
32
Choosing a Cutoff Value: ROC Plots
• Derived from clinical medicine
• Shows balance of costs of missing a true positive versus falsely accepting a negative
• Area under the curve is a measure of quality :– - .90-1 = excellent (A) – - .80-.90 = good (B) – - .70-.80 = fair (C) – - .60-.70 = poor (D) – - .50-.60 = fail (F)
…ask more of your data
33
ROC Plot for MAO
…ask more of your data
34
Postscript: non-FP Descriptors
• AlogP– A measure of the octanol/water partition coefficient– High value means molecule "prefers" to be in octanol rather
than water – i.e., is nonpolar– A real number
• Molecular Weight– Total mass of all of the atoms making up the molecule– Units are atomic mass units (a.m.u.) in which the mass of
each proton or neutron is approximately 1– A positive real number
…ask more of your data
35
Postscript: non-FP Descriptors
• Num H Acceptors, Num H Donors– Molecules may link to each other via hydrogen bonds– H-bonds are weaker than true chemical bonds– H-bonds play a role in drug activity– H donors are polar atoms such as N and O with an attached H
(can "donate" a hydrogen to form H-bond)– H acceptors are polar atoms lacking an attached H (can
"accept" a hydrogen to form H-bond)– Num H Acceptors, Num H Donors are counts of atoms
meeting the above criteria– Non-negative integers
…ask more of your data
36
Postscript: non-FP Descriptors
• Num Rotatable Bonds– Certain bonds between atoms are rigid
• Bonds within rings• Double and triple bonds
– Others are rotatable • Attached parts of molecule can freely pivot around bond
– Num Rotable Bonds is count of rotatable bonds in molecule– A non-negative integer