Top Banner
Pierre Vermaak UCT
28

Pierre Vermaak UCT. An attempt to automate the discovery of initial solution candidates. Example-based learning Why? ◦ Track record on difficult.

Dec 14, 2015

Download

Documents

Ashley Gaines
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Pierre VermaakUCT

Page 2: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

An attempt to automate the discovery of initial solution candidates.

Example-based learning Why?

◦ Track record on difficult problems◦ Very different to ∆2 – approaches ;

complimentary◦ Neat

In this talk, I’ll give a practical perspective

Page 3: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

What is it?◦ Very broad field◦ Data mining◦ Machine learning

Well-known algorithms◦ Neural Networks◦ Tree inducers (J48, M5P)◦ Support Vector Machines◦ Nearest Neighbour

Same idea throughout ...

Page 4: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Attempt to map input to output◦ e.g. binary lens light curve -> model parameters

Uses example data set: “training set” ◦ e.g. many simulated curves and their model

parameters Adjusts learning model parameters to best

fit of training data: “training”◦ Usually some sort of iteration◦ Algorithm dependent

Evaluation◦ Usually performance measured on unseen data

set: “test set”.

Page 5: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Famous data set by Fisher. Want to classify irises into three categories

based on petal and sepal width and length. 150 examples

sepallength (cm) sepalwidth (cm) petallength (cm) petalwidth (cm) class5 2 3.5 1 Iris-versicolor6 2.2 4 1 Iris-versicolor6.2 2.2 4.5 1.5 Iris-versicolor6 2.2 5 1.5 Iris-virginica4.5 2.3 1.3 0.3 Iris-setosa

sepallength (cm) sepalwidth (cm) petallength (cm) petalwidth (cm) class5 2 3.5 1 Iris-versicolor6 2.2 4 1 Iris-versicolor6.2 2.2 4.5 1.5 Iris-versicolor6 2.2 5 1.5 Iris-virginica4.5 2.3 1.3 0.3 Iris-setosa

Data SnippetData Snippet

Page 6: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.
Page 7: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

“OneR”◦ Deduces a rule based on one input column

(“attribute”)

ResultsResults

RuleRule

Page 8: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

“Multi-layer Perceptron” ◦ simple Neural Network

ResultResult

NetworkNetwork

Page 9: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Neat examples ... Can it be used on the real problem? Issues

◦ Many ambiguities of binary model◦ No uniform input - not uniformly sampled◦ Noise◦ Complexity

Page 10: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Success/failure with a variety of approaches The approach I’d like to take DIY tools for the job

◦ Do try this at home.

Page 11: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

“Raw” light curves are unsuitable Require uniform inputs for training

◦ And the same scheme needs to be applied to subsequent unseen curves

Interpolation – non-trivial◦ Which scheme? What biases are introduced?

Smoothing – non-trivial◦ Required for interpolation anyway◦ Also for derived features (extrema, slope)

Centering/Scaling – non-trivial◦ Algorithms performed much better with normalized light

curves◦ What to centre on? Peak? Which one?◦ What baseline? Real curves are truncated

Page 12: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

How many example curves to use? Ranges of binary lens model parameters in

training set. Noise model for example curves. Choice of learning algorithm Pre-processing parameters etc.

Page 13: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Normalized Curves◦ Using truncation/centering/scaling and smoothing

Derived Features◦ Attempt to extract properties of a light curve◦ PCA◦ polynomial fits◦ extrema◦ etc.

Page 14: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Various schemes attempted Most successful

◦ Find time corresponding to peak brightness◦ Translate the curve in time to this this value◦ Discard all data fainter (by magnitude) than 20%

of the total magnitude range◦ Normalize the time axis (-0.5 to 05)

Page 15: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Required for interpolation of equally-spaced data points on the curve

Too much smoothing destroys features Too little smoothing turns noise into

features Final scheme was a fitted B-spline iteration.

◦ Fit a B-spline◦ Count extrema◦ Repeat until number of extrema in suitable range◦ Worked out to be surprisingly robust

Page 16: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Truncation◦ Slope-based ... Numerical derivatives too noisy◦ Fitting a simpler model (Gaussian, single-lens)◦ Brightness exceeds 3 standard deviations of wing

brightness Smoothing

◦ Moving window averaging – destroys small features

◦ Savitzky-Golay – only works on evenly-spaced points

Page 17: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Chebyshev PolynomialsChebyshev Polynomials

Page 18: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

PCAPCA

Page 19: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Single lens fits Moments Derivatives Smoothed Curves Time and Magnitude of extrema

Features are then selected for usefulness using selection algos (brute-force, information-based, etc.)

Page 20: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Using simulated curvesUsing simulated curves

Page 21: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.
Page 22: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.
Page 23: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

The pre-processed curves themselves performed slightly better than derived features.

A simple learning algorithm performed best (nearest neighbour)

It sort of works on real events, but not at Production strength and still with intervention.

Still required Genetic Algo fine-tuning. Not good at finding multiple solutions

Page 24: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Automation: Mimic a human expert Categorize curves instantly Use categorization to come up with joint

likelihood distribution in model parameter space.

I want multiple solutions and large regions of exclusion.

Page 25: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Still believe in feature selection Eliminate dodgy pre-processing

◦ Smoothing◦ Interpolation

Use fast fits of “basis” functions◦ Possibly use binary curves themselves for

comparison, but with a robust distance metric.◦ Use the quality of fits as main feature◦ Fit a single lens and characterize residuals

Page 26: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

These algorithms are very powerful But no algorithm is any good against

impossible odds. So, alternative parameterizations, etc. are

extremely important to this approach, just like to traditional fitting.

Page 27: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

Java◦ 60%-100% as fast as C++ nowadays◦ Cross-platform◦ Plugs into and out of everything (Python, legacy

COM, Matlab, etc.)◦ Oh, the tools! – Parallelisation, IDE’s, just

everything. “javalens” – my rather humble new Java code

◦ Asada’s method◦ Lots of abstraction, more like framework◦ Open Source◦ Search “javalens” on google code

Page 28: Pierre Vermaak UCT.  An attempt to automate the discovery of initial solution candidates.  Example-based learning  Why? ◦ Track record on difficult.

R◦ Awesome, free and open source statistics environment◦ Can be called from Java

WEKA◦ Great data mining app, used extensively in my thesis◦ Dangerous! Can spend years playing with it.◦ Make sure you concentrate on the sensibility of your data◦ NOT the large variety of fitting algorithms

Netbeans◦ Just a great free, open source Java IDE◦ Code completion◦ Automatic refactoring tools

VI◦ No comment