Top Banner
Machine learning for bioinformatics A/Prof Nicola Armstrong Mathematics and Statistics Murdoch University
34

Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Jun 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MachinelearningforbioinformaticsA/ProfNicolaArmstrongMathematicsandStatistics

MurdochUniversity

Page 2: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Machinelearning (ML)isthescientificstudy ofalgorithms andstatisticalmodels thatcomputersystems useinordertoperformaspecifictaskeffectivelywithoutusingexplicitinstructions,relyingonpatternsandinferenceinstead.

Wikipedia

Page 3: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

InBioinformatics…

Page 4: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific
Page 5: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific
Page 6: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific
Page 7: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

7

Classification

Page 8: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

8

Classification

Training SetData with known

classes

ClassificationTechnique

Classificationrule

Discrimination

Page 9: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

9

Classification

Training SetData with known

classes

ClassificationTechnique

Classificationrule

Data with unknown classes

ClassAssignment

Discrimination

Prediction

Page 10: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

10

Classification Rule

Classification techniqueFeature selection

Parameters [pre-determined, estimable]Distance measure

Aggregation methods

Theclassificationruleislikeablackbox,somemethodsprovidemoreinsightintothe contentsofthebox

Page 11: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ClassificationTechniques

• DecisionTreebasedMethods– e.g.randomforests(Breiman 2001)

• Rule-basedMethods• Memorybasedreasoning• NeuralNetworks• NaïveBayes(DLDA)andBayesianBeliefNetworks

• SupportVectorMachines

Page 12: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MultipleRegression

Linear

Logistic

• Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevels.

• Yisthephenotype:– Linear:continuousphenotypicmeasurement.– Logistic:0=no,1=yes.

• β aretheregressioncoefficients.12

! = !!!!!!!!!⋯!!!!!1+ !!!!!!!!!⋯!!!!!

Page 13: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβridge istheridgeregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofridgepenalty.

istheridgepenalty.

RidgeRegression

13

Page 14: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

LASSO(theLeastAbsoluteShrinkageandSelectionOperator)

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβlasso isthelassoregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountoflassopenalty.

isthelassopenalty.14

Page 15: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ElasticNet

Xarethegenomicinformationateachlocus,e.g.SNPsormethylationlevelsYisthephenotypeβ0,β istheelasticnetregressioncoefficient.λ≥0isthetuningparameterthatcontrolstheamountofelasticnetpenalty.0≤α≤1iselasticpenaltyweight

15

Page 16: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Ridgevs.LASSOvs.ElasticNet

16

*Notwell– incasenoofSNPs>>noofpeople,themaximumnumberofvariablesthatLASSOcanselectbeforeitsaturatesisequaltothenumberofpeople.

Allregressionmethodsrelyonlinearityassumption

Page 17: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

RandomForest

Page 18: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

NeuralNetworks

Page 19: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

EnsemblepackagesinR

• Allowapplicationandevaluationofmultipletechniquesonadataset

• Simpleandeasytouse

• CMA• caret• ClassifyR

Page 20: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Method Description Function(s) DM DV DD

Wrapper for sparsediscrim’s diagonal LDA function dlda. DLDAtrainInterface,DLDApredictInterface P

Wrapper for PoiClaClus’s Poisson LDA function classify. classifyInterface P

Wrapper for glmnet’s elastic net GLM function glmnet. elasticNetGLMinterface P

Wrapper for pamr’s Nearest Shrunken Centroid functions pamr.train and pamr.predict.

NSCtrainInterfaceNSCpredictInterface P

Wrapper for multinomial logistic regression as implemented in CRAN package ‘mnlogit’.

logisticRegressionTrainInterfacelogisticRegressionPredictInterface

P

Fisher’s Linear Discrimiant Analysis fisherDiscriminant P P*

Feature-wise mixtures of normals and voting mixModelsTrain, mixModelsPredict P P P

Feature-wise kernel density estimation and voting naiveBayesKernel P P P

Wrapper forrandomForest'sfuctionrandomForest. randomForestInterface P P P

Wrapper for e1071’s Support Vector Machine functionsvm.

SVMinterface P P† P†

Classification

* If ordinary numeric measurements have been transformedto absolute deviations by subtractFromLocation.

† If kernel is not “linear”.

Page 21: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MODELPERFORMANCE

Page 22: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

22

Validation:Performanceassessment

• Canbebasedon:– Cross-validation– Testset– Independenttestingonfuturedataset.– Independenttestingonexistingdataset(integrativeanalysis).

Page 23: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Cross-validation

23

PartitiondataintondisjointsetsS1,S2,…,Sn

OmitSk

UsingalldataexceptSk buildclassifier

UseclassifiertopredictclassesforSk

Fork=1,…,n

Fori=1,…,100

Summarystatisticsofperformance

Page 24: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MetricsforPerformanceEvaluation

• Focusonthepredictivecapabilityofamodel– Ratherthanhowfastittakestoclassifyorbuildmodels,scalability,etc.

• ConfusionMatrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a:TP(truepositive)

b:FN(falsenegative)

c:FP(falsepositive)

d:TN(truenegative)

Page 25: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Accuracy

• Issueifdatahighlyskewed/biased– 0.5%ofdataisinclass1andrestisinclass0.Modelhas99.5%accuracy!But,yourmodelcouldjustbe:classifyeachobservationtobeincategory0.

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Page 26: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Othermetrics• Misclassificationrate:1- Accuracy

• Sensitivity/Recall/truepositiverate:TP/(TP+FN)

• Specificity/truenegativerate:TN/(TN+FP)

• Positivepredictivevalue/Precision:TP/(TP+FP)

• Negativepredictivevalue:TN/(TN+FN)

• F-score:harmonicmeanofprecision&recall2*(precision*recall)/(precision+recall)

1=good,0=bad,doesn’tconsiderTNs

Page 27: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ROC(ReceiverOperatingCharacteristic)

• Developedin1950sforsignaldetectiontheorytoanalyzenoisysignals– Characterizethetrade-offbetweenpositivehitsandfalsealarms

• ROCcurveplotssensitivity (onthey-axis)against(1-specificity)(onthex-axis)

• PerformanceofeachclassifierrepresentedasapointontheROCcurve– changingthethresholdofalgorithm,sampledistributionorcostmatrixchangesthelocationofthepoint

Page 28: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ROCCurve(Sensitivity,1-specificity):• (0,0):declareeverything

tobenegativeclass• (1,1):declareeverything

tobepositiveclass• (1,0):ideal

• Diagonalline:– Randomguessing– Belowdiagonalline:

• predictionisoppositeofthetrueclass

Page 29: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

ModelSelection

• Inpractice,oftennotmuchdifferenceinperformancebetweenseveralapproaches.

• Aimtochoosethemodelwhichis:– Interpretable- canweseeorunderstandwhythemodelismakingthedecisionsitmakes?

– Simple- easytoexplainandunderstand– Accurate– Fast(totrainandtest)– Scalable(canbeappliedtoalargedataset)

Page 30: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

EXAMPLES

Page 31: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

HiddenMarkovModels

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3 xK

2

1

K

2HiddenStatesπi

Observations

Page 32: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

exon 1 exon 2 exon 3

AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACG

gene prediction:

input sequence:most probable path:

Genefinding

Page 33: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

Crossoversinmeiosis

ChromHMM:annotatinggenomicregions

Ernst&Kellis 2002

Page 34: Machine learning for bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/sites/15/201… · Machine learning for bioinformatics ... Machine learning(ML) is the scientific

MammaPrintvan‘tVeeretalNature2002;vandeVijver etalNEJM2002

Basedoncorrelation