This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS760 – Machine CS760 – Machine LearningLearning
• Course Instructor: David PageCourse Instructor: David Page• email: [email protected]: [email protected]• office: MSC 6743 (University & Charter) office: MSC 6743 (University & Charter) • hours: 1pm Tuesdays and Fridayshours: 1pm Tuesdays and Fridays
• We’ll meet 30 times this term (may or may We’ll meet 30 times this term (may or may not include exam in this count)not include exam in this count)
• We’ll meet on FRIDAY this and next week, in We’ll meet on FRIDAY this and next week, in order to cover material for HW 1order to cover material for HW 1(plus I have some business travel this term)(plus I have some business travel this term)
• DefaultDefault: we will NOT meet on Fridays unless I : we will NOT meet on Fridays unless I announce it (at least one week’s notice)announce it (at least one week’s notice)
• to understand to understand whatwhat a learning a learning system should dosystem should do
• to understand to understand howhow (and how (and how wellwell) ) existing systems workexisting systems work• Issues in algorithm designIssues in algorithm design• Choosing algorithms for applicationsChoosing algorithms for applications
• Some written and programming HWsSome written and programming HWs• "hands on" experience valuable"hands on" experience valuable• HW0 – build a datasetHW0 – build a dataset• HW1 – experimental methodologyHW1 – experimental methodology• I’m updating the website as we go, so please I’m updating the website as we go, so please
wait for me to assign HWs in classwait for me to assign HWs in class
• "Midterm" exam "Midterm" exam (in class, about 90% through semester)(in class, about 90% through semester)
• Find project of your choosingFind project of your choosing• during last 4-5 weeks of classduring last 4-5 weeks of class
Academic Misconduct Academic Misconduct (also on course homepage)(also on course homepage)
All examinations, programming assignments, All examinations, programming assignments, and written homeworks must be done and written homeworks must be done individuallyindividually. Cheating and plagiarism will be . Cheating and plagiarism will be dealt with in accordance with University dealt with in accordance with University procedures (see the procedures (see the Academic Misconduct Guide for Students). ). Hence, for example, code for programming Hence, for example, code for programming assignments must not be developed in groups, assignments must not be developed in groups, nor should code be shared. You are encouraged nor should code be shared. You are encouraged to discuss with your peers, the TAs or the to discuss with your peers, the TAs or the instructor ideas, approaches and techniques instructor ideas, approaches and techniques broadly, but not at a level of detail where broadly, but not at a level of detail where specific implementation issues are described by specific implementation issues are described by anyone. If you have any questions on this, anyone. If you have any questions on this, please ask the instructor before you act. please ask the instructor before you act.
A Few Examples of A Few Examples of Machine LearningMachine Learning• Movie recommender (Netflix prize… Movie recommender (Netflix prize… ensembleensembles)s)• Your spam filter (probably Your spam filter (probably naïve Bayesnaïve Bayes))• Google, Microsoft and YahooGoogle, Microsoft and Yahoo• Predictive models for medicine (e.g. see news on Predictive models for medicine (e.g. see news on
Health Discovery Corporation and Health Discovery Corporation and SVMsSVMs))• Wall Street (e.g., Rebellion research)Wall Street (e.g., Rebellion research)• Speech recognition (Speech recognition (hidden Markov modelshidden Markov models) and ) and
natural language translationnatural language translation• Identifying the proteins of an organism from its Identifying the proteins of an organism from its
genome (also using HMMs… see CS/BMI 576)genome (also using HMMs… see CS/BMI 576)• Many examples in scientific data analysis…Many examples in scientific data analysis…
• A breakthrough in mach. learning would be worth ten A breakthrough in mach. learning would be worth ten MicrosoftsMicrosofts
Bill Gates, Chairman, MicrosoftBill Gates, Chairman, Microsoft
• Machine learning is the next InternetMachine learning is the next InternetTony Tether, previous Director, DARPATony Tether, previous Director, DARPA
• Machine learning is the hot new thingMachine learning is the hot new thingJohn Hennessy, President, StanfordJohn Hennessy, President, Stanford
• Web rankings today are mostly a matter of machine learningWeb rankings today are mostly a matter of machine learningPrabhakar Raghavan, Director of Research, YahooPrabhakar Raghavan, Director of Research, Yahoo
• Machine learning is going to result in a real revolutionMachine learning is going to result in a real revolutionGreg Papadopoulos, CTO, SunGreg Papadopoulos, CTO, Sun
• Machine learning is today’s discontinuityMachine learning is today’s discontinuityJerry Yang, founder and former CEO, YahooJerry Yang, founder and former CEO, Yahoo
Some QuotesSome Quotes(taken from P. Domingos’ ML class notes at U-Washington) (taken from P. Domingos’ ML class notes at U-Washington)
• Employed by first machine Employed by first machine learning systems, in 1950slearning systems, in 1950s• Samuel’s Checkers programSamuel’s Checkers program• Michie’s MENACE: Matchbox Educable Michie’s MENACE: Matchbox Educable
Naughts and Crosses EngineNaughts and Crosses Engine
• Prior to these, some people Prior to these, some people believed computers could not believed computers could not improveimprove at a task at a task with experiencewith experience
• Memorize I/O pairs and perform Memorize I/O pairs and perform exact matching with new inputsexact matching with new inputs
• If computer has not seen precise If computer has not seen precise case before, it cannot apply its case before, it cannot apply its experienceexperience
• Want computer to “generalize” Want computer to “generalize” from prior experiencefrom prior experience
Some Settings in Some Settings in Which Learning May Which Learning May HelpHelp• Given an input, what is appropriate Given an input, what is appropriate
response (output/action)?response (output/action)?• Game playing – board state/moveGame playing – board state/move• Autonomous robots (e.g., driving a vehicle) Autonomous robots (e.g., driving a vehicle)
-- world state/action-- world state/action• Video game characters – state/actionVideo game characters – state/action• Medical decision support – symptoms/ Medical decision support – symptoms/
Broad Paradigms of Broad Paradigms of Machine LearningMachine Learning
• Inducing Functions from I/O PairsInducing Functions from I/O Pairs• Decision trees (e.g., Quinlan’s C4.5 [1993])Decision trees (e.g., Quinlan’s C4.5 [1993])• Connectionism / neural networks (e.g., backprop)Connectionism / neural networks (e.g., backprop)• Nearest-neighbor methodsNearest-neighbor methods• Genetic algorithmsGenetic algorithms• SVM’s SVM’s
• Learning without Learning without Feedback/TeacherFeedback/Teacher• Conceptual clusteringConceptual clustering• Self-organizing systemsSelf-organizing systems• Discovery systemsDiscovery systems
• We are assuming examples are We are assuming examples are IID: IID: independently identically independently identically distributeddistributed
• Eg, we are ignoring Eg, we are ignoring temporaltemporal dependencies (covered in dependencies (covered in time-series learningtime-series learning))
• Eg, we assume the learner has no Eg, we assume the learner has no say in which examples it gets say in which examples it gets (covered in (covered in active learningactive learning))
Empirical Learning: Empirical Learning: Task DefinitionTask Definition• Given Given
• A collection of A collection of positivepositive examples of some examples of some concept/class/category (i.e., members of the class) and, concept/class/category (i.e., members of the class) and, possibly, a collection of the possibly, a collection of the negativenegative examples (i.e., non- examples (i.e., non-members)members)
• ProduceProduce• A description that A description that coverscovers (includes) all/most of the (includes) all/most of the
positive examples and non/few of the negative examples positive examples and non/few of the negative examples
(and, hopefully, properly categorizes most future (and, hopefully, properly categorizes most future examples!)examples!)
Note: one can easily extend this definition to handle more than two Note: one can easily extend this definition to handle more than two classesclasses
If examples are described in terms of If examples are described in terms of values of features, they can be plotted values of features, they can be plotted as points in an as points in an NN-dimensional space.-dimensional space.
Size
Color
Weight
?Big
2500
Gray
A “concept” is then a (possibly disjoint) volume in this space.
• More formally a “concept” is of the More formally a “concept” is of the formform• x y z F(x, y, z) -> Member(x, Class1)x y z F(x, y, z) -> Member(x, Class1)
Aspects of an ML Aspects of an ML SystemSystem• ““Language” for representing classified Language” for representing classified
examplesexamples• ““Language” for representing “Concepts”Language” for representing “Concepts”• Technique for producing concept Technique for producing concept
“consistent” with the training examples“consistent” with the training examples• Technique for classifying new instanceTechnique for classifying new instance
Each of these limits the Each of these limits the expressivenessexpressiveness//efficiencyefficiency of the supervised learning algorithm.of the supervised learning algorithm.
Collect Collect KK nearest neighbors, select majority nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)
• What should What should KK be? be?• It probably is problem dependentIt probably is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select
HW0 – Create Your Own HW0 – Create Your Own Dataset Dataset (repeated from lecture (repeated from lecture #1)#1)
• Think about before next classThink about before next class• Read HW0 (on-line)Read HW0 (on-line)
• Google to find:Google to find:• UCI archive (or UCI KDD archive)UCI archive (or UCI KDD archive)• UCI ML archive (UCI ML repository)UCI ML archive (UCI ML repository)• More links in HW0’s web pageMore links in HW0’s web page
Standard Feature TypesStandard Feature Typesfor representing training examples for representing training examples – a source of “ – a source of “domain knowledgedomain knowledge””
• NominalNominal• No relationship among possible valuesNo relationship among possible values
e.g., e.g., color color єє {red, blue, green} {red, blue, green} (vs.(vs. color = 1000 color = 1000 Hertz)Hertz)• Linear (or Ordered)Linear (or Ordered)
• Possible values of the feature are totally orderedPossible values of the feature are totally orderede.g., e.g., size size єє {small, medium, large}{small, medium, large} ←← discretediscrete
• DiscreteDiscrete• tokens (char strings, w/o quote marks and tokens (char strings, w/o quote marks and
spaces)spaces)
• ContinuousContinuous• numbers (int’s or float’s)numbers (int’s or float’s)
• If only a few possible values (e.g., 0 & 1) use If only a few possible values (e.g., 0 & 1) use discretediscrete
• i.e., merge i.e., merge nominalnominal and and discrete-ordereddiscrete-ordered (or convert (or convert discrete-ordereddiscrete-ordered into 1,2, into 1,2,……))
• We will ignore hierarchical info and We will ignore hierarchical info and only use the leaf values (common approach)only use the leaf values (common approach)
HW0: HW0: Creating Your DatasetCreating Your Dataset
Ex: IMDB has a lot of data that Ex: IMDB has a lot of data that are not discrete or are not discrete or continuous or binary-valued continuous or binary-valued for target function for target function (category)(category)Studio
Movie
Director/Producer
ActorMade
Acted inDirected
NameCountryList of movies
NameYear of birthGenderOscar nominationsList of movies
Title, Genre, Year, Opening Wkend BO receipts,List of actors/actresses, Release season
HW0: Representing as a HW0: Representing as a Fixed-Length Feature Fixed-Length Feature VectorVector
<discuss on chalkboard><discuss on chalkboard>
Note: some advanced ML approaches do Note: some advanced ML approaches do not not require such “feature mashing” require such “feature mashing” (eg, ILP)(eg, ILP)
David Jensen’s group at UMass uses David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the Naïve Bayes and other ML algo’s on the IMDBIMDB
From Earlier: From Earlier: MemorizationMemorization
• Employed by first machine Employed by first machine learning systems, in 1950slearning systems, in 1950s• Samuel’s Checkers programSamuel’s Checkers program• Michie’s MENACE: Matchbox Educable Michie’s MENACE: Matchbox Educable
Naughts and Crosses EngineNaughts and Crosses Engine
• Prior to these, some people Prior to these, some people believed computers could not believed computers could not improveimprove at a task at a task with experiencewith experience
• Memorize I/O pairs and perform Memorize I/O pairs and perform exact matching with new inputsexact matching with new inputs
• If computer has not seen precise If computer has not seen precise case before, it cannot apply its case before, it cannot apply its experienceexperience
• Want computer to “generalize” Want computer to “generalize” from prior experiencefrom prior experience
Collect Collect KK nearest neighbors, select majority nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)
• What should What should KK be? be?• It probably is problem dependentIt probably is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select
• KK-Nearest Neighbors / -Nearest Neighbors / Instance-Based Learning (Instance-Based Learning (kk-NN/IBL)-NN/IBL)• Distance functionsDistance functions• Kernel functionsKernel functions• Feature selection (applies to all ML Feature selection (applies to all ML
• ClassificationClassification• Learning a Learning a discretediscrete valued function valued function
• RegressionRegression• Learning a Learning a realreal valued function valued function
IBL easily extended to regression IBL easily extended to regression tasks (and to multi-category tasks (and to multi-category classification)classification)
• IB2IB2 – keep next instance if – keep next instance if incorrectlyincorrectly classified by using previous instancesclassified by using previous instances• Uses less storage (good)Uses less storage (good)• Order dependent (bad)Order dependent (bad)• Sensitive to noisy data (bad)Sensitive to noisy data (bad)
(From Aha, Kibler and Albert in ML Journal)(From Aha, Kibler and Albert in ML Journal)
Variations on a Theme Variations on a Theme (cont.)(cont.)• IB3IB3 – extend IB2 to more intelligently decide – extend IB2 to more intelligently decide
which examples to keep (see article)which examples to keep (see article)• Better handling of noisy dataBetter handling of noisy data
• Another IdeaAnother Idea - - cluster groups, keep cluster groups, keep example from each (median/centroid)example from each (median/centroid)• Less storage, faster lookupLess storage, faster lookup
• Collect Collect kk nearest neighbors nearest neighbors• Give them to some supervised ML algoGive them to some supervised ML algo• Apply learned model to test exampleApply learned model to test example
• Learning “Juntas” (Blum, Langley ‘94)Learning “Juntas” (Blum, Langley ‘94)• Target concept is a function of a small Target concept is a function of a small
subset of the features -- subset of the features -- relevantrelevant featuresfeatures
• Most features are irrelevant (not Most features are irrelevant (not correlated with relevant features)correlated with relevant features)
• In this case, nearness for kNN is In this case, nearness for kNN is based mostly on based mostly on irrelevantirrelevant features features
• ML method we will discuss next ML method we will discuss next time is Decision Tree learningtime is Decision Tree learning
• Tree learners focus on choosing Tree learners focus on choosing the most relevant features, so the most relevant features, so address Junta-learning betteraddress Junta-learning better
• They choose features one at a They choose features one at a time, in a time, in a greedygreedy fashion fashion
• Later we will cover Later we will cover support vector support vector machinesmachines (SVMs) (SVMs)
• As kNN, SVMs classify a new instance As kNN, SVMs classify a new instance based on similarity to other instances, based on similarity to other instances, use kernels to capture similarityuse kernels to capture similarity
• But SVMs also assign intrinsic weights But SVMs also assign intrinsic weights to examples (apart from distance)… to examples (apart from distance)… “support vectors” have weight > 0“support vectors” have weight > 0
Forward vs. Backward Forward vs. Backward Feature SelectionFeature Selection
• Faster in early steps Faster in early steps because fewer because fewer features to testfeatures to test
• Fast for choosing a Fast for choosing a small subset of the small subset of the featuresfeatures
• Misses useful Misses useful features whose features whose usefulness requires usefulness requires other features other features (feature synergy)(feature synergy)
• Fast for choosing all Fast for choosing all but a small subset but a small subset of the featuresof the features
• Preserves useful Preserves useful features whose features whose usefulness requires usefulness requires other featuresother features• Example: area Example: area
important, important, features = length, features = length, widthwidth
Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)
• Computationally expensive to Computationally expensive to save all examples; slow save all examples; slow classification of new examplesclassification of new examples• Addressed by IB2/IB3 of Aha et al. Addressed by IB2/IB3 of Aha et al.
and work of A. Moore (CMU; now and work of A. Moore (CMU; now Google)Google)
• Is this really a problem?Is this really a problem?
Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)
• Intolerant of NoiseIntolerant of Noise• Addressed by IB3 of Aha et al.Addressed by IB3 of Aha et al.• Addressed by Addressed by kk-NN version-NN version• Addressed by feature selection - can Addressed by feature selection - can
discard the noisy featurediscard the noisy feature• Intolerant of Irrelevant FeaturesIntolerant of Irrelevant Features
• Since algorithm very fast, can Since algorithm very fast, can experimentally choose good feature experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon)sets (Kohavi, Ph. D. – now at Amazon)
• High sensitivity to choice of similiarity High sensitivity to choice of similiarity (distance) function(distance) function• Euclidean distance might not be best Euclidean distance might not be best
choicechoice
• Handling non-numeric features and Handling non-numeric features and missing feature values is not natural, missing feature values is not natural, but doablebut doable
• No insight into task No insight into task (learned concept not interpretable)(learned concept not interpretable)