Top Banner
Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 1 Machine Learning for Scientific Discovery Cheng Soon Ong Machine Learning Research Group Data61 | CSIRO, Canberra 25 November 2016 Faculté Informatique et Communications, EPFL
57

Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

May 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Cheng Soon Ong: Machine Learningfor Scientific Discovery, Page 1

Machine Learningfor Scientific Discovery

Cheng Soon Ong

Machine Learning Research GroupData61 | CSIRO, Canberra

25 November 2016Faculté Informatique et Communications, EPFL

Page 2: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Data61

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 2

NICTA merger

Part of CSIRO, focus on ICT

Approx 1000 researchers, PhD students and university staff

Page 3: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Applications - Optimization - Models

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 3

...2-mers

freq

AV Mathematical Model

minw,b

1

2w2 + C

n

i=1

ξi

s.t. yi(w, xi + b) 1 − ξi

Numerical Optimization

PredictionTask

Page 4: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Machine Learning and Science

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 4

Page 5: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

What is machine learning?

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 5

Machine learning is about prediction

Examples/features x1, . . . , xn ∼ XLabels/annotations y1, . . . , yn ∼ YPredictor fw(x) : X→ Y

Estimate best predictor = trainingGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

No mechanistic model of the phenomenonThere is relatively large amounts of data (examples, x usually Rd)The outcomes (labels, y usually binary) are well defined

Prediction 6= understandingHow can we use prediction to help with scientific research?

Page 6: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Today: focus on the predictor

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 6

fw(x) : X→ Y

Label: Finding black holes

Exist physical models, we directly use imagesThere is relatively large amounts of data (examples)Object localisation with crowd labels

Feature: Finding genetic associations

No mechanistic model of the phenomenonHigh dimensional low sample sizeStability of feature selection

Predictor: Finding good experiments

Partial mechanistic model of the phenomenonEstimate the expected information gain

Discuss challenges to applying machine learning

Page 7: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Not standard binary classifcation

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 7

fw(x) : X→ Y

Page 8: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Finding black holes

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 8

Goal: Automate radio cross-identification, a problem in astronomy

Too much data

Collaboration with ANU, ANTF, CAASTROSquare kilometer array (South Africa and Australia)

Labelled by non-experts

Convert object localisation to binary classificationDeal with label noise

Page 9: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Radio cross-identification

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 9

Optical Infrared X-ray Radio

Images of Centaurus A at different wavelengths.

Page 10: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

The real data

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 10

The same patch of sky in both radio (left) and infrared (right)

Page 11: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Localisation as binary classification

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 11

Galaxy catalogue as candidatesCould scan a patch across the sky

Classify pairs of images

positive

negative

Features: Neural network image features, fluxes, radial distancehttps://github.com/chengsoonong/crowdastro

Page 12: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Crowdsourcing labels

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 12

Radio Galaxy Zoo:citizen science project to cross identify radio galaxies

Radio Galaxy ZooAbout 100000 of 177000 image pairs labelled.

5 volunteers per pair for compact sources20 volunteers per pair for complex sources

Page 13: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

How to find black holes

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 13

Prior catalogues

Heuristic rules + expert human effortNorris et. al. 2006

Annotation based on physical modelsFan et. al. 2015

Use set where both agree as gold standard

Many labels to one binary label

Logistic regression from sklearnMajority voteEM style algorithm to estimate ground truthRaykar et. al. 2010, Yan et. al. 2010

Latent variable model

Noisy labels = ground truth + biased coin flip

Page 14: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Results

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 14

LR(N

orri

s)

LR(F

an)

LR(R

GZ-

MV

)

Rayk

ar(R

GZ-

Top-

50)

RGZ-

Raw

-MV

70

75

80

85

90

95

100

Bala

nced

acc

urac

y (%

)

Conclusion: Features meaningful, but pipeline can be improved.

Page 15: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Side note about label noise

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 15

Latent variableAssume that there is a hidden ground truth label, and model it.Alger, Banfield, Ong, (in preparation)

Learning with label noiseDuring training, pretend that labels are noiseless, and assume thatthe learning algorithm takes care of it.Menon, van Rooyen, Ong, Williamson, ICML 2015

Model evaluationHow do we measure performance without ground truth?

Page 16: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

What are good features?

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 16

fw(x) : X→ Y

Page 17: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Genome wide association study

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 17

Case-control studiesA cohort of sick individuals (cases) and healthy individuals (con-trols) are genotyped and their corresponding binary phenotype arerecorded.

We use the framework of hypothesis testing

Hypothesis testing Given a case control study, test whether a particu-lar SNP is associated with the phenotype.

Good biomarker? If difference is statistically significant=⇒SNP is associated with the phenotype.

bioinformatics.research.nicta.com.au/software/gwis/

Page 18: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Epistatic Interactions

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 18

Genome Wide Interaction Search (GWIS)Consider the association of all pairs of genotypes to phenotypes

Large search space

5000 individuals, 500,000 SNPs (WTCCC)Need to tabulate 125 billion contingency tables

Classification based analysisFocus on SNPs in case control studiesNew statistical testsConsider specificity and sensitivityGain over univariate ROCCPU (≈ days) and GPU (≈ hours)Store the top 1 million pairs

Web servicehttp://gwis1.research.nicta.com.au/Goudey,...,Ong,...,Kowalczyk, BMC Genomics, 2013

Page 19: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

p-values

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 19

Interpreting p-valuesIs 10−10 probability of association very significant?

Quote... but a reliable method of procedure. In relation tothe test of significance, we may say that a phenomenonis experimentally demonstrable when we know how toconduct an experiment which will rarely fail to give us astatistically significant result.Fisher, The Design of Experiments, 1947, p. 14

Stability of scoringWe consider p-values as a score of association.

How stable is this score if we repeat the experiment?How do we combine scores?

Challenges

Scores available for only the top-k examplesScores from different sources not calibrated

Page 20: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

How to represent ranks?

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 20

Multiple ways to represent ranks

Ordered list of n objects selected from Ω

List of values [1, . . . , n] (the ranks of the object)

Normalised ranks ∈ (0, 1)

Permutation mapping R : Ω→ (0, 1)

Page 21: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Measuring Overlap

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 21

MotivationGiven a set of replicated experiments, how do we measure overlap?

Examples

Perform repeated splits of the dataExperiments on different cohortsMultiple sources of information

Challenges

Scores available for only the top-k examplesScores from different sources not calibrated

Page 22: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Signal and Noise

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 22

Page 23: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Set based overlap

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 23

Running example (6 objects)

A = [a, b, c, d, e, f ]

B = [a, b, e, f, c, d]

Jaccard Indexoverlap =

|A ∩B||A ∪B|

Measuring stability

Easy to computeWorks for top-k listsConsider the top-3 lists from above:

Jaccard index =|a, b||a, b, c, e| =

1

2

Ignores the order given by scores

Page 24: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Spearman’s ρ

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 24

Similar to Pearson’s correlation for the measure of dependence

Spearman’s ρ is a correlation measure between ranked lists

ρ(A,B) :=

∑i(r

(i)A − rA)(r

(i)B − rB)√∑

i(r(i)A − rA)2

∑i(r

(i)B − rB)2

,

Running example:

ρ([a, b, c, d, e, f ], [a, b, e, f, c, d]) = 0.543

(Jaccard index = 1)

Need the same elements in A and B

ρ([a, b, c], [a, b, e]) ?

Page 25: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Spearman’s ρ on top k lists

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 25

Simple ideaDefine Spearman’s ρ for top k lists

Key observationAny elements in list A that do not appear in list B must have a rankhigher than the number of elements in B

Running example (top-3)

A = [a, b, c, d, e, f ] and B = [a, b, e, f, c, d]

A3 = [a, b, c] and B3 = [a, b, e]

A3

B3→ = [a, b, c, e] and B3

A3→ = [a, b, e, c]

Spearman’s ρ = ρ(A3

B3→, B3

A3→) = 0.8

Page 26: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Spearman’s ρ on top k lists

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 26

Extend the listWe expand lists A and B to complete rankings over the same set ofelements, denoting them as A

B→ and BA→ respectively.

The missing values in the extension are given the average rank.

Running example (top-4)

A4 = [a, b, c, d] and B4 = [a, b, e, f ]

A4

B4→ = [1, 2, 3, 4, 5.5, 5.5] and B4

A4→ = [1, 2, 5.5, 5.5, 3, 4]

Makes no assumption about the order of the unranked objects

Other possible imputation approaches

OptimisticWorst case

Bedo, Rawlinson, Goudey, Ong, PLoS ONE, 2014

Page 27: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Signal and Noise

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 27

Page 28: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Spearman’s ρ

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 28

Page 29: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Simulate two cohorts by splitting

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 29

101 102 103 104 105

Top k GWIS pairs

0.5

0.0

0.5

1.0

Spearm

an's

ρ

Cross validation stability for raWTC-GSS

bivariate

Bedo, Rawlinson, Goudey, Ong, PLoS ONE, 2014

Page 30: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Measuring Overlap

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 30

MotivationGiven a set of replicated experiments, how do we measure overlap?

Challenges

Scores available for only the top-k examplesScores from different sources not calibrated

Model

Ranked list Instead of just using set intersection, we can usethe scores from GWIS to order the resultstop k Traditional methods (Spearman’s ρ) requires ranks forthe whole list. We have incomplete information, but we know ourranks are the top ones.Multivariate Textbook Spearman’s ρ is for computing correla-tion between two ranks. We want to compute the correlation be-tween multiple ranked lists.Bedo, Ong, JMLR (to appear)

Page 31: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Multiple replicates

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 31

101 102 103 104 105

Top k GWIS pairs

0.5

0.0

0.5

1.0

Spearm

an's

ρ

Cross validation stability for raWTC-GSS

optimisticempiricalbivariate

Page 32: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

*-Seq

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 32

dsRNA-Seq

FRAG-Seq

SHAPE-Seq

PARTE-Seq

PARS-Seq

DMS-Seq...

Nucleo-Seq

DNAse-Seq

Sono-Seq

ChIA-PET-Seq

FAIRE-Seq

NOMe-Seq

ATAC-Seq...

GRO-Seq

Quartz-Seq

CAGE-Seq

Nascent-Seq

Cel-Seq

3P-Seq...

https://liorpachter.wordpress.com/seq/

Page 33: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Integrating different sources of data

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 33

Varia%on o  SNP o  Structural o Methyla/on o  Expression o  …

A B C D E F G H I J K L

ID3023

ID4454

ID7675

ID2283 Sequence Analysis

Associa%on Study

Page 34: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Rank aggregation

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 34

Modeling using Spearman’s correlation

Stability of feature selectionHow to measure overlap?

ρ(R1, . . . , Rd)

Rank aggregationHow to combine different sources of information?Macintyre, Yepes, Ong, Verspoor, PeerJ, 2014

Page 35: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Optimal aggregator: geometric mean

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 35

How to combine different sources of information?We maximise multivariate correlation

R∗ = arg maxR

ρ(R,R1, R2, . . . , Rd).

Theorem The aggregator that maximises multivariate Spearman’s cor-relation is the product of the normalised ranks.

Use the geometric mean

NOT pairwise correlationInstead of decomposing the association into a combination of pair-wise similarities ρ(R,R1), ρ(R,R2), . . . , ρ(R,Rd).

Learning weighting of expertsWe can also do supervised learning to rank

Bedo, Ong, JMLR (to appear)

Page 36: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

What are good biomarkers?

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 36

Genome Wide Association Studies

Which mutations are associated with tall poppies?Identify biomarkers with hypothesis tests

Finding stable biomarkers

Split cohort into two (cross validation)Investigate rank correlation between scores

Integrating information via ranks

Multivariate Spearman correlation using copulasGeometric mean is the optimal aggregator

Page 37: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

What to measure?

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 37

fw(x) : X→ Y

Page 38: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Active Learning / Expt. Design

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 38

Use predictor to identify good candidates

Annotate top-k itemsConfidence interval improves performanceExplore - exploit tradeoff

Krause, Ong, NIPS 2011

Finding black holes and redshifts

Machine learning to classify imagesShow 10 candidates to expert daily

Collaboration with ANU, ANTF, CAASTRO

Glucose metabolism in Yeast

Multiple possible modelsDesign biological experiments thatmaximise information gain

Collaboration of ETHZ with SystemsX Switzerland

Page 39: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

What is a model?

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 39

Bergman insulin dependent glucose metabolism model.

Page 40: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

TOR pathway

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 40

Page 41: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Finding good models

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 41

Page 42: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Optimised experimental design

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 42

MeasurementsExperiments produce readouts y(ti),grouped into datasets Yπ for an experiment π.

Bayes ruleFor a particular model f , (taking care of parameters)

p(f |Yπ) =p(Yπ|f )p(f )

p(Yπ)

Information gainWe want to take measurements that change model probabilities

DKL[p(f |Yπ)||p(f )] =∑

f∈Fp(f |Yπ) log2 p(f |Yπ)/p(f )

Marginalise over possible outcomesMaximise expected information gain (tough computational problem)

argmaxπ

EYπDKL[p(f |Yπ)||p(f )]

Page 43: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Experiments, experiments, ...

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 43

What is a biomarker?

How to measure?Use adaptive experimental design toidentify important time series.Busetto et. al. Near-optimal experimental design for

model selection in systems biology , 2013

What to measure?Combine various sources of informa-tion for robust decision making.Macintyre et. al. Associating disease-related genetic

variants in intergenic regions to the genes they impact,

2014

Where to measure?Use expert domain knowledge to con-struct dynamical models.Brodersen et. al. Generative embedding for model-

based classification of fMRI data, 201 1

Page 44: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

A more philosophical section...

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 44

fw(x) : X→ Y

Label: Finding black holes

Exist physical models, we directly use imagesThere is relatively large amounts of data (examples)Object localisation with crowd labels

Feature: Finding genetic associations

No mechanistic model of the phenomenonHigh dimensional low sample sizeStability of feature selection

Predictor: Finding good experiments

Partial mechanistic model of the phenomenonEstimate the expected information gain

Discuss challenges to applying machine learning

Page 45: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Applications - Optimization - Models

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 45

...2-mers

freq

AV Mathematical Model

minw,b

1

2w2 + C

n

i=1

ξi

s.t. yi(w, xi + b) 1 − ξi

Numerical Optimization

PredictionTask

Page 46: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Scoring candidates - ABCDE

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 46

Active Learning

Choose a particular example to label using heuristicsAnnotator assumed to provide ground truth

Bandits

Select a choice from a set of actionsSimple algorithms with theoretical guaranteesManage uncertainty with repeated sampling

Choice theory

Aggregate set of ranks into one orderingEconomics and social science, impossiblity theorems

Designing Experiments

Choose a set of trials to measureOptimisation algorithms with theoretical analysisInformation theory, real random variables

Page 47: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

ML Open Source Software

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 47

Wider adoption of methods

Domain experts can use machine learning coreAvailable for teaching

Scientific reproducibility

Fair comparison of methodsAccess to scientific tools

Community growth

“Given enough eyeballs, all bugs are shallow”Combination of advances

mloss.org mldata.org

Page 48: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Plug and Pray

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 48

Machine Learning Open Source SoftwareDo We Need Hundreds of Classifiersto Solve Real World Classification Problems?jmlr.org/papers/v15/delgado14a.html

Spoiler: No

Usability and Reproducibility

(too much) focus on new algorithmsDocumentation, modularity issuesLiterate programmingyihui.name/knitr jupyter.org

Scientific computing workflowsgalaxyproject.org

Dream: App Bazaar for data science

Page 49: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Bumpy road to data science

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 49

Two classes of objects

Dataimages, counts, raw sensor data, output of simulation, results

Analysisvisualisation, user interface, predictors, observational statistics

Multi-sided platform

Decentralised architecture, not walled gardenEnable direct interaction between data owner and analytics sys-temNetwork effect: each new entrant benefits from whole network

Not just tech peopleDomain experts, data managers, project management

Page 50: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Wish list

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 50

We need an open federated framework for scientific discovery

Provenance, trust and reliability

Management of legal rights

Uncertainty propagation

Confidentiality and privacy

Complex workflows

Late binding ontologies

Cross organisation, jurisdiction, technical boundaries

Decouple technique from problem

No proprietary control

*-as-a-service

Page 51: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

One more challenge

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 51

McCulloch and Pitts, 1943

Multilayer perceptron

Deep neural networks32 × 28 × 28 32 × 14 × 14 32 × 10 × 10 32 × 5 × 5 80032 × 32

4 × 4 convolution 2 × 2 max pooling4 × 4 convolution 2 × 2 max pooling

Page 52: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

One more challenge

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 52

McCulloch and Pitts, 1943

Multilayer perceptron

Deep neural networks

Today’s ML systems

How to analyse two systems?

Page 53: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Conclusion

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 53

Prediction 6= understandingHow can we use prediction to help with scientific research?

Three extensions

Not standard binary classification fw(x) : X→ Y

What are good features? fw(x) : X→ Y

What to measure? fw(x) : X→ Y

Plug and pray

Software, software, softwareBuild the road and rail for data scienceUnderstand combinations of machine learning components

Page 54: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Thank You

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 54

Prediction 6= understandingHow can we use prediction to help with scientific research?

Three extensions

Not standard binary classification fw(x) : X→ Y

What are good features? fw(x) : X→ Y

What to measure? fw(x) : X→ Y

Plug and pray

Software, software, softwareBuild the road and rail for data scienceUnderstand combinations of machine learning components

Please make your research open

www.ong-home.my

Page 55: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Copulas

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 55

IntuitionFor continuous random variables, copulas model the dependencecomponent after discounting for univariate marginal effects

Probabilistic definitionLet U1, . . . , Ud be real random variables ∼ U([0, 1]).A copula function C : [0, 1]d −→ [0, 1] is a joint distribution

Cθ(u1, . . . , ud) = P (U1 6 u1, . . . , Ud 6 ud)

The same Gaussian copula function

Page 56: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Copulas and Spearman’s ρ

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 56

Spearman’s ρ can be expressed in terms of the copula

ρ(A,B) = 12

[0,1]2C(u, v)dudv − 3

Empirical copula

Cn(u, v) =1

|Ω|∑

x∈Ω

1 (R(x) 6 u, S(x) 6 v)

Why do the math?

Unclear how to extend formula for Spearman’s correlation.Multivariate distributions⇒ multivariate copula.

Page 57: Machine Learning for Scientific DiscoveryCheng Soon Ong: Machine Learning for Scientific Discovery, Page 15 Latent variable Assume that there is a hidden ground truth label, and model

Multivariate Spearman’s ρ

Cheng Soon Ong: Machine Learning for Scientific Discovery, Page 57

A multivariate extension of Spearman’s ρFor a d dimensional set of random variables u, the multivariateSpearman’s ρ is given by

ρ(R1, . . . , Rd) = Q(C, π) = h(d)

(2d∫

[0,1]dπ(u) dC(u)− 1

),

whereh(d) =

d + 1

2d − (d + 1).

Empirical multivariate Spearman’s corelation

ρn(R1, . . . , Rd) = h(d)

2d

n

x

d∏

j=1

Rj(x)− 1

.

No negative correlationAs the number of dimensions increases, the lower bound of Spear-man’s ρ tends to zero