Top Banner
Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London
23

Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Jan 12, 2016

Download

Documents

Katrina Holt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Automated Theory Formation in Bioinformatics

Simon ColtonComputational Bioinformatics LabImperial College, London

Page 2: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Predictive Toxicology

Drug companies lose lots of money Developing drugs which are toxic

Machine Learning Problem: Given + and -, why are + toxic (active)

Machine Learning Approaches: ILP, Neural Nets, Linear regression, CART

Two more today: Automated Theory Formation Template search

Important: scientists want: Predictive accuracy and scientific knowledge

Page 3: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Automated Theory Formation (ATF) Questions Given some background information

Concepts, hypotheses (axioms) And some objects of interest

Numbers, Molecules, etc. Find something interesting Interesting things could be:

Concepts, examples, hypotheses, explanations

Page 4: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

ATF Overview

Scientific theories contain (at least): Concepts: salt, acid, base Hypotheses: acid + base => salt + water Explanations: transfer of electrons, dissolving

So, ATF should do (at least): Concept formation, Conjecture making Hypothesis proving and disproving.

Also needs to: Measure interestingness, present results, etc.

Page 5: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

HR Theory Formation System

Developed in maths Designed to be general purpose system

Concept-based theory formation Tries to make concept Makes conjecture when it can’t make a

concept Tries to explain conjectures

Measures of interestingness To direct a heuristic search

Page 6: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Concept Formation in HR 10 General Production Rules

Take in old concepts, produce new concepts

Split

Negate

Size

SplitCompose

[a,b] : b|a

[a,n]:n = |{b:b|a}|

[a]:2=|{b:b|a}|

[a] : 2|a

[a] : not 2|a

[a]:2=|{b:b|a}| & not 2|a (Odd Prime Numbers)

Page 7: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Conjecture Making

Empirical checks are performed After each attempt to invent a new concept

If the concept has no examples Makes non-existence conjecture

If concept has same examples as previous Makes an equivalence conjecture

If another concept subsumes the concept Makes an implication conjecture

Page 8: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Conjecture Extraction Suppose HR makes equivalence conjecture:

P(a) & Q(a) R(a) & S(a) Extracts:

P(a) & Q(a) => R(a), P(a) & Q(a) => S(a) R(a) & S(a) => P(a), R(a) & S(a) => Q(a)

Tries to Extract: P(a) => R(a), Q(a) => R(a), etc. Prime implicates (require proving, though)

Important: gets Horn Clauses Can be expressed in Prolog…..

Page 9: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Greatest Hits (in Maths)

Pre-processing constraint problems Learning properties of residues classes Inventing integer sequences Puzzle generation Adding to the TPTP library Setting mathematical tutorial questions

See Springer Book for Details

Page 10: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Long term aim in Bioinformatics

Develop an ATF system Working in biological domains

Biologist provides little background info In a format they are happy with

Program provides results Intelligent, interesting, not too much, And very little rubbish

Page 11: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Some short term aims in Bioinformatics HR can work with biological data

Takes input similar to Progol Use HR to solve ML problems

See how bad an idea that is Use theory formation to improve ML

Integrate HR and Progol somehow Push the envelope

Give biologists more information

Page 12: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Approach to ML Tasks

Give HR the same input as Progol Get it to form a theory

Look at the theory Extract concepts which look similar to

the target concept Not a goal-based approach

Bad idea (slow) Implemented a reactive search

Much faster

Page 13: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Mutagenesis(42) Data

Mutagenesis related to carcinogenisis 42 drugs supplied with atom-bond details

Atom type, number & charge, bond type (1-8) 13 are mutagenic (active), 29 not active Progol learned this concept (88% accurate)

active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)

c,21 ? ?1 2

Page 14: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

HR’s Results

Using reactive search, four PRs, 30K steps HR learned these concepts:

active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E) active(A) :- bond(A,B,C,D), atm(B,E,21), atm(C,F,38) active(A) :- \+ (bond(A,B,C,D), atm(B,F,21), bond(A,C,D,E)) Also 88% accurate But, Progol’s answer “better” Because higher information content (fewer ?s) Biologists sometimes want more information

?,21 ? ?1 ?

Page 15: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

But…..

HR also made these equivalence conjectures And extracted them (+100 more) for us

atm(B,X1,21) atm(B,c,21)atm(B,X1,38) atm(B,n,38)bond(A,B,C,X1) & atm(C,X2,38) bond(A,B,C,1) & atm(C,X3,38)bond(A,X1,B,X2) & atm(B,X3,38) bond(A,B,X4,2), atm(B,X5,38)

We used these to re-write HR’s answer By hand, but hope to automate

Page 16: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Giving us this answer:

Remember that Progol’s Answer was:

c,21 ? ?1 2

c,21 n,38 ?1 2

So, we filled in one of the blanks!

Page 17: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Are we making a meal of this?

Yes, possibly for the mutagenesis data I was worried about the difficulty of this problem

In the last fortnight: “template search” 200-line Prolog program And can be distributed over multiple processors And can be easily understood by biologists

And gets these results….

Page 18: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Template search – Results More specific substructure found:

(88% accurate on 42, 88% cross validation)

c,21 n,38 o,401 2

o,402

More general substructure found: Also 88% accurate

c,211

Page 19: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Template Search - Assumptions

Connected substructures Are interesting answers Progol’s answers are all substructures

More specific substructures are OK Biologists may even want lots of information Don’t forget that they want to do science

Each learned concept will be true of At least one active (positive) molecule

Page 20: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Template Search - Overview User specifies:

Template for substructures How general the solution can become (IC limit) Example 3 ?s allowed in above template

?,? ?,? ?,?

??

Mitchell: FIND-S routine (very simple) Algorithm starts with the first positive

And extracts all the substructures (in template) Then takes the next positive and for each substructure

It finds the least general generalisation So the new substructure is true of both +ves Do not over-generalise (IC limit)

Page 21: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Using the results

Procedure finds many results Ranging from specific to general

So, user must be advised on usage Take the most specific best Take the most general best Take a disjunction of all best answers Take a more intelligent disjunction

Cross validation results required To tell user predictive accuracy

Page 22: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

More Results

Three in a row template (2 minutes) 6 answers with 88% over the 42 examples [c,21,_,_,o,_,_,_]

[c,21,_,_,_,_,1,_] [c,21,_,_,_,_,1,2] [c,21,_,_,o,_,1,2] [_,_,c,21,n,38,7,1]

[c,21,n,38,o,40,1,2]

Take most general/specific: 88% 1-fold cross-valTake disjunction of all: 88% cross-validation Take more intelligent disjunction:

(95% accurate on 42, 80% cross validation)

c,21 n,38 o,401 2

c,? c,22 ?

-0.132

c,195 c,22 h,3

0.145

17 7 1

Page 23: Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Conclusions & Future Work

Automated Theory Formation May be useful to bioinformatics Use HR’s theory to improve Progol’s results

• Possibly by pre-processing Progol’s input• Or by post-processing the learned concept

Template search Maybe a good idea? Simple, push envelope Nice results for the Mutagenesis(42) dataset Distribute the process

• Processor per Positive (PPP)