Top Banner
COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Rich Caruana Alex Niculescu Alex Niculescu http://www.cs.cornell.edu/Courses/ http://www.cs.cornell.edu/Courses/ cs578/2002fa cs578/2002fa
40

COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

COM 578Empirical Methods in Machine

Learning and Data Mining

COM 578Empirical Methods in Machine

Learning and Data Mining

Rich CaruanaRich Caruana

Alex NiculescuAlex Niculescu

http://www.cs.cornell.edu/Courses/cs578/2002fahttp://www.cs.cornell.edu/Courses/cs578/2002fa

Page 2: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

TodayToday

Dull organizational stuffDull organizational stuff– Course SummaryCourse Summary– GradingGrading– Office hoursOffice hours– HomeworkHomework– Final ProjectFinal Project

Fun stuffFun stuff– Historical Perspective on Statistics, Machine Learning, Historical Perspective on Statistics, Machine Learning,

and Data Miningand Data Mining

Page 3: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

TopicsTopics

Decision TreesDecision Trees K-Nearest NeighborK-Nearest Neighbor Artificial Neural NetsArtificial Neural Nets Support VectorsSupport Vectors Association RulesAssociation Rules ClusteringClustering Boosting/BaggingBoosting/Bagging Cross ValidationCross Validation

Data VisualizationData Visualization Data TransformationData Transformation Feature SelectionFeature Selection Missing ValuesMissing Values Case Studies:Case Studies:

– Medical predictionMedical prediction

– Protein foldingProtein folding

– Autonomous vehicle Autonomous vehicle navigationnavigation

25-50% overlap with CS478

Page 4: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

GradingGrading

20% take-home mid-term20% take-home mid-term 20% open-book final20% open-book final 30% homework assignments30% homework assignments 30% course project (teams of 1-3 people)30% course project (teams of 1-3 people)

late penalty: one letter grade per daylate penalty: one letter grade per day

Page 5: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Office HoursOffice Hours

Rich CaruanaRich CaruanaUpson Hall 4157Upson Hall 4157Tue 4:30-5:00pmTue 4:30-5:00pm Wed 1:30-2:30pmWed 1:30-2:[email protected]@cs.cornell.edu

Alex NiculescuAlex NiculescuRhodes Hall ???Rhodes Hall [email protected]@cs.cornell.edu

Page 6: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

HomeworksHomeworks

short programming assignmentsshort programming assignments– e.g., implement backprop and test on a datasete.g., implement backprop and test on a dataset– goal is to get familiar with a variety of methodsgoal is to get familiar with a variety of methods

two or more weeks to complete each assignmenttwo or more weeks to complete each assignment C, C++, Java, Perl, shell scripts, or MatlabC, C++, Java, Perl, shell scripts, or Matlab must be done individuallymust be done individually hand in code with summary and analysis of resultshand in code with summary and analysis of results

Page 7: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

ProjectProject Mini CompetitionMini Competition Train best model on two different problems we give youTrain best model on two different problems we give you

– decision treesdecision trees– k-nearest neighbork-nearest neighbor– artificial neural netsartificial neural nets– bagging, boosting, model averaging, ...bagging, boosting, model averaging, ...

Given train and test setsGiven train and test sets– Have target values on train setHave target values on train set– No target values on test setNo target values on test set– Send us predictions and we calculate performanceSend us predictions and we calculate performance– Performance on test sets is part of project gradePerformance on test sets is part of project grade

Due before exams: Friday, December 6Due before exams: Friday, December 6

Page 8: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Text BooksText Books Required Texts:Required Texts:

– Machine LearningMachine Learning by Tom Mitchell by Tom Mitchell– Elements of Statistical Learning: Data Mining, Inference, and Elements of Statistical Learning: Data Mining, Inference, and

PredictionPrediction by Hastie, Tibshirani, and Friedman by Hastie, Tibshirani, and Friedman

Optional Texts:Optional Texts:– Pattern ClassificationPattern Classification, 2nd ed., by Richard Duda, Peter Hart, & , 2nd ed., by Richard Duda, Peter Hart, &

David StorkDavid Stork– Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques by Jiawei Han and by Jiawei Han and

Micheline KamberMicheline Kamber

Selected papersSelected papers

Page 9: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Fun StuffFun Stuff

Page 10: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics, Machine Learning, and Data Mining

Statistics, Machine Learning, and Data Mining

Page 11: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Past, Present, and FuturePast, Present, and Future

Page 12: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Once upon a time...Once upon a time...

Page 13: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics: 1850-1950Statistics: 1850-1950

Hand-collected data setsHand-collected data sets– Physics, Astronomy, Agriculture, ...Physics, Astronomy, Agriculture, ...– Quality control in manufacturingQuality control in manufacturing– Many hours to collect/process each data pointMany hours to collect/process each data point

Small: 1 to 100 data pointsSmall: 1 to 100 data points Low dimension: 1 to 10 variablesLow dimension: 1 to 10 variables Exist only on paper (sometimes in text books)Exist only on paper (sometimes in text books) Experts get to know data inside outExperts get to know data inside out Data is clean: human has looked at each pointData is clean: human has looked at each point

Page 14: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics: 1850-1950Statistics: 1850-1950

Calculations done manuallyCalculations done manually– manual decision making during analysismanual decision making during analysis– human calculator pools for “larger” problemshuman calculator pools for “larger” problems

Simplified models of data to ease computationSimplified models of data to ease computation– Gaussian, Poisson, Gaussian, Poisson,

Get the most out of precious dataGet the most out of precious data– careful examination of assumptionscareful examination of assumptions– outliers examined individuallyoutliers examined individually

Page 15: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics: 1850-1950Statistics: 1850-1950

Analysis of errors in measurementsAnalysis of errors in measurements What is most efficient estimator of some value?What is most efficient estimator of some value? How much error in that estimate?How much error in that estimate? Hypothesis testing:Hypothesis testing:

– is this mean larger than that mean?is this mean larger than that mean?

– are these two populations different?are these two populations different?

Regression:Regression:– what is the value of y when x=xwhat is the value of y when x=xii or x = x or x = xjj??

How often does some event occur?How often does some event occur?– p(fail(partp(fail(part11)) = p)) = p11; p(fail(part; p(fail(part22)) = p)) = p22; p(crash(plane)) = ?; p(crash(plane)) = ?

Page 16: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 17: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics would look very different if it had been born after

the computer instead of 100 years before the computer

Statistics would look very different if it had been born after

the computer instead of 100 years before the computer

Page 18: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics meets ComputersStatistics meets ComputersStatistics meets ComputersStatistics meets Computers

Page 19: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Machine Learning: 1950-2000...Machine Learning: 1950-2000...

Medium size data sets become availableMedium size data sets become available– 100 to 100,000 records100 to 100,000 records– High dimension: 5 to 250 dimensions (more if vision)High dimension: 5 to 250 dimensions (more if vision)– Fit in memoryFit in memory

Exist in computer, not usually on paperExist in computer, not usually on paper Too large for humans to read and fully understandToo large for humans to read and fully understand Data not cleanData not clean

– Missing values, errors, outliers,Missing values, errors, outliers,– Many attribute types: boolean, continuous, nominal, Many attribute types: boolean, continuous, nominal,

discrete, ordinaldiscrete, ordinal

Page 20: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Machine Learning: 1950-2000...Machine Learning: 1950-2000...

Computers can do Computers can do veryvery complex calculations on complex calculations on medium size data setsmedium size data sets

Models can be much more complex than beforeModels can be much more complex than before Empirical evaluation methods instead of theoryEmpirical evaluation methods instead of theory

– don’t calculate expected error, measure it from sampledon’t calculate expected error, measure it from sample– cross validationcross validation

Fewer statistical assumptions about dataFewer statistical assumptions about data Make machine learning as automatic as possibleMake machine learning as automatic as possible OK to have multiple models (vote them)OK to have multiple models (vote them)

Page 21: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Machine Learning: 1950-2000...Machine Learning: 1950-2000...

New Problems:New Problems:– Can’t understand many of the modelsCan’t understand many of the models– Less opportunity for human expertise in processLess opportunity for human expertise in process– Good performance in lab doesn’t necessarily mean Good performance in lab doesn’t necessarily mean

good performance in practicegood performance in practice– Brittle systems, work well on typical cases but often Brittle systems, work well on typical cases but often

break on rare casesbreak on rare cases– Can’t handle heterogeneous data sourcesCan’t handle heterogeneous data sources

Page 22: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

ML: Pneumonia Risk PredictionML: Pneumonia Risk Prediction

PneumoniaRisk

Age

Gen

der

Blo

od P

ress

ure

Che

st X

-Ray

Pre-Hospital Attributes

Alb

umin

Blo

od p

O2

Whi

te C

ount

RB

C C

ount

In-Hospital Attributes

Page 23: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

ML: Autonomous Vehicle NavigationML: Autonomous Vehicle Navigation

Steering Direction

Page 24: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 25: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Can’t yet buy cars that drive themselves, and no hospital uses artificial neural nets yet to make critical decisions about patients.

Can’t yet buy cars that drive themselves, and no hospital uses artificial neural nets yet to make critical decisions about patients.

Page 26: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Machine Learning Leaves the Lab

Computers get Bigger/Faster

Machine Learning Leaves the Lab

Computers get Bigger/Faster

Page 27: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Data Mining: 1995-20??Data Mining: 1995-20??

Huge data sets collected fully automaticallyHuge data sets collected fully automatically– large scale science: genomics, space probes, satelliteslarge scale science: genomics, space probes, satellites

Page 28: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 29: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Protein FoldingProtein Folding

Page 30: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 31: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 32: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Data Mining: 1995-20??Data Mining: 1995-20??

Huge data sets collected fully automaticallyHuge data sets collected fully automatically– large scale science: genomics, space probes, satelliteslarge scale science: genomics, space probes, satellites– consumer purchase dataconsumer purchase data– web: > 100,000,000 pages of textweb: > 100,000,000 pages of text– clickstream data (Yahoo!: gigabytes per hour)clickstream data (Yahoo!: gigabytes per hour)– many heterogeneous data sourcesmany heterogeneous data sources

High dimensional dataHigh dimensional data– ““low” of 45 attributes in astronomylow” of 45 attributes in astronomy– 100’s to 1000’s of attributes common100’s to 1000’s of attributes common– Linkage makes many 1000’s of attributes possibleLinkage makes many 1000’s of attributes possible

Page 33: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Data Mining: 1995-20??Data Mining: 1995-20??

Data exists only on disk (can’t fit in memory)Data exists only on disk (can’t fit in memory) Experts can’t see even modest samples of dataExperts can’t see even modest samples of data Calculations done completely automaticallyCalculations done completely automatically

– large computerslarge computers– efficient (often simplified) algorithmsefficient (often simplified) algorithms– human intervention difficulthuman intervention difficult

Models of dataModels of data– complex models possiblecomplex models possible– but complex models may not be affordable (Google)but complex models may not be affordable (Google)

Get something useful out of massive, opaque dataGet something useful out of massive, opaque data

Page 34: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Data Mining: 1990-20??Data Mining: 1990-20??

What customers will respond best to this coupon?What customers will respond best to this coupon? Who is it safe to give a loan to?Who is it safe to give a loan to? What products do consumers purchase in sets?What products do consumers purchase in sets? What is the best pricing strategy for products?What is the best pricing strategy for products? Are there unusual stars/galaxies in this data?Are there unusual stars/galaxies in this data? Do patients with gene X respond to treatment Y?Do patients with gene X respond to treatment Y? What job posting best matches this employee?What job posting best matches this employee? How do proteins fold?How do proteins fold?

Page 35: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Data Mining: 1995-20??Data Mining: 1995-20??

New Problems:New Problems:– Data too bigData too big– Algorithms must be simplified and very efficient Algorithms must be simplified and very efficient

(linear in size of data if possible, one scan is best!)(linear in size of data if possible, one scan is best!)– Reams of output too large for humans to comprehendReams of output too large for humans to comprehend– Garbage in, garbage outGarbage in, garbage out– Heterogeneous data sourcesHeterogeneous data sources– Very messy uncleaned dataVery messy uncleaned data– Ill-posed questionsIll-posed questions

Page 36: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 37: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Statistics, Machine Learning, and Data Mining

Statistics, Machine Learning, and Data Mining

Historic revolution and refocusing of statisticsHistoric revolution and refocusing of statistics Statistics, Machine Learning, and Data Mining Statistics, Machine Learning, and Data Mining

merging into a new multi-faceted fieldmerging into a new multi-faceted field Old lessons and methods still apply, but are used Old lessons and methods still apply, but are used

in new ways to do new thingsin new ways to do new things Those who don’t learn the past will be forced to Those who don’t learn the past will be forced to

reinvent itreinvent it

Page 38: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .

Change in Scientific MethodologyChange in Scientific Methodology

TraditionalTraditional:: Formulate hypothesisFormulate hypothesis Design experimentDesign experiment Collect dataCollect data Analyse resultsAnalyse results Review hypothesisReview hypothesis Repeat/PublishRepeat/Publish

NewNew:: Design large experimentDesign large experiment Collect large dataCollect large data Put data in large databasePut data in large database Formulate hypothesisFormulate hypothesis Evaluate hyp on databaseEvaluate hyp on database Run limited experiments Run limited experiments

to drive nail in coffinto drive nail in coffin Review hypothesisReview hypothesis Repeat/PublishRepeat/Publish

Page 39: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .
Page 40: COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana Alex Niculescu .