Machine Learning for Language Technology 2015 http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm Decision Trees (1 part) Marina Santini [email protected] Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015
Machine Learning for Language Technology 2015http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm
Decision Trees (1 part)
Marina [email protected]
Department of Linguistics and PhilologyUppsala University, Uppsala, Sweden
Autumn 2015
Outline
• Greediness
• Divide and Conquer
• Inductive Bias of the Decision Tree
• Loss function
• Expected loss
• Empirical error
• Induction
Lecture 3: Decision Trees (1) 2
Learning: Generalization Ability
• Predicting the future based on the past
Lecture 3: Decision Trees (1) 3
Predict whether a student will like a course
Lecture 3: Decision Trees (1) 4
Training Data
Lecture 3: Decision Trees (1) 5
That is, ....
• Questions = Features
• Answers = Feature Values
• Ratings = Class Labels
• An example is a set of feature values.
• Traning data is a set of examples associated with class labels.
Lecture 3: Decision Trees (1) 6
”Greedy model”: the most useful feature
– Histograms
– Rood node
Lecture 3: Decision Trees (1) 7
Divide & Conquer
• Divide:
– Partition the date into 2 parts:
• YES part vs NO part
• Conquer:
– Recurse and run the Divide routine
Lecture 3: Decision Trees (1) 8
The end of the cycle
• ... When it becomes useless to query on additional features
Lecture 3: Decision Trees (1) 9
Decision tree: Inductive Bias
• The goal of the decision tree learning model is:– to figure out what questions to ask
– in what order
– what answer to predict once you have asked enough questions
– The inductive bias of decision trees: The things that we want to learn to predect are more like the root node and less like the other branch nodes.
Lecture 3: Decision Trees (1) 10
Informal Definition
• A decision tree is:
– a flow-chart-like structure, where
• each internal (non-leaf) node denotes a test on an attribute,
• each branch represents the outcome of a test, and
• each leaf (or terminal) node holds a class label.
• The topmost node in a tree is the root node.
Lecture 3: Decision Trees (1) 11
Formalising the learning problem:1) the loss function
loss function
Lecture 3: Decision Trees (1) 12
Formalising the learning problem:2) Data Generating Distribution
D ( x, y )
Lecture 3: Decision Trees (1) 13
Expected Loss
1. The loss function
2. The data generating distribution
Lecture 3: Decision Trees (1) 14
Formulae: Expected Value
How to read:
= epsilon
= equal by definition to (or: is defined as)
= blackboard-bold E
= sub the pair xy
= over script D
= l of the pair y f of x
15
Sum over all the pairs xy in script D of x and y times l of y and f of x
Training Error
• The training error is the average error over the training data
• How to read: the training error epsilon-hat is equal by definition to 1 over N of the Sum from n=1 to capital N of “l” of y and f of x.
Lecture 3: Decision Trees (1) 16
Empirical Error
• Alpaydin (2010: 24): the empirical error is the proportion of training instances where the predictions of h (the hypothesis = the informed guess) do not mach the required values given in X (the training set). The error of the the hypothesis h given the training set X is:
Lecture 3: Decision Trees (1) 17
Induction
Given:
• a loss function l
and
• a sample d from some unknown distribution D
• you must compute a function f that has low expected error ε over D with respect to l.
Lecture 3: Decision Trees (1) 18
Quiz 1: Training error
• How would you define a training error on a dataset:
1. Training error is the average loss over the training sample
2. Training error is the expected prediction error over an independent test sample
3. None of the above
Lecture 3: Decision Trees (1) 19
Quiz 2: Distributions
What kind of distribution is D
in the formula above?
1. Normal
2. Unknown
3. None of the above
Lecture 3: Decision Trees (1) 20
Quiz 3: Loss function
• How would you define a loss function?
1. The loss function L(actual value, predicted value) characterizes how bad predictions are
2. The loss function is an unknown distribution
3. Both definitions are incorrect.
Lecture 3: Decision Trees (1) 21
The End
Lecture 3: Decision Trees (1) 22