Top Banner
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
44

Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Mar 16, 2018

Download

Documents

duongthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning

Vibhav Gogate

The University of Texas at Dallas

Page 2: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Machine Learning

2

Supervised Learning

Y Discrete Y Continuous

Gaussians Learned in closed form

Linear Functions 1. Learned in closed form 2. Using gradient descent

Decision Trees Greedy search; pruning Probability of class | features 1. Learn P(Y), P(X|Y); apply Bayes 2. Learn P(Y|X) w/ gradient descent

Parametric

Reinforcement Learning

Unsupervised Learning

Non-parametric

Non-probabilistic Linear: perceptron gradient descent Nonlinear: neural net: backprop Support vector machines

Nearest Neighbor methods Locally weighted regression

Page 3: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 4: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 5: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Key Perspective on Learning

• Learning as Optimization

– Closed form

– Greedy search

– Gradient ascent

• Loss Function

– Error + regularization

5

Page 6: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 7: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 8: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

What you should know in Decision Tree Learning?

• Heuristics for selecting the next attribute – Information gain, one-step look ahead, gain ratio – What makes the heuristic good? – What are its cons? – Complexity analysis – Sample exam question: if I tweak the selection heuristic,

how will that change the complexity and quality?

• What kind of functions it can learn. • Overfitting and Pruning • Handling missing data • Handling continuous attributes

Page 9: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

• Noise • Small number of examples associated with

each leaf • What if only one example is associated

with a leaf. Can you believe it? • Coincidental regularities

Page 10: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Probability Theory

• Be able to apply and understand

– Axioms of probability

– Distribution vs density

– Conditional probability

– Sum-rule, chain-rule

– Bayes rule

• Sample question: If you know P(A|B), do you have enough information to compute P(B|A)?

Page 11: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Maximum Likelihood Estimation

• Data: Observed set D of H Heads and T Tails

• Hypothesis: Binomial distribution

• Learning: finding is an optimization problem

– What’s the objective function?

• MLE: Choose to maximize probability of D

Page 12: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

How to get a closed form solution?

• Set derivative to zero, and solve!

Page 13: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

What if I have prior beliefs?

• Billionaire says: Wait, I know that the thumbtack is “close” to 50-50. What can you do for me now?

• You say: I can learn it the Bayesian way…

• Rather than estimating a single , we obtain a distribution over possible values of

In the beginning After observations

Observe flips e.g.: {tails, tails}

Page 14: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Bayesian Learning

Use Bayes rule:

Or equivalently:

Also, for uniform priors:

Prior

Normalization

Data Likelihood

Posterior

reduces to MLE objective

Page 15: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

MAP: Maximum a Posteriori Approximation

• As more data is observed, Beta is more certain

• MAP: use most likely parameter to approximate the expectation

Page 16: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

What you should know?

• MLE vs MAP and the relationship between the two

• MLE learning and Bayesian learning

– Thumbtack example

– Gaussians

Page 17: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

The Naïve Bayes Classifier • Given:

– Prior P(Y)

– n conditionally independent features X given the class Y

– For each Xi, we have likelihood P(Xi|Y)

• Decision rule:

Y

X1 Xn X2

Page 18: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Subtleties of Naïve Bayes

• What is the hypothesis space?

• What kind of functions can it learn?

• When does it work and when it does not?

– Correlated features

• MLE vs Bayesian learning of Naïve Bayes

• Gaussian Naïve Bayes

Page 19: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Generative vs. Discriminative Classifiers

• Want to Learn: h:X Y – X – features – Y – target classes

• Generative classifier, e.g., Naïve Bayes: – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) – This is a ‘generative’ model

• Indirect computation of P(Y|X) through Bayes rule • As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y)

• Discriminative classifiers, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model

• Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available

19

P(Y | X) P(X | Y) P(Y)

Page 20: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Linear Regression

hw(x) = w1x + w0

w1 =

Argminw Loss(hw)

w0 = ((yj)–w1(xj)/N

N(xjyj)–(xj)(yj)

N(xj2)–(xj)

2

Page 21: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Logistic Regression

Learn P(Y|X) directly!

Assume a particular functional form

Not differentiable…

21

P(Y)=1

P(Y)=0

Page 22: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Logistic Regression

Learn P(Y|X) directly!

Assume a particular functional form

Logistic Function

Aka Sigmoid

22

Page 23: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Issues in Linear and Logistic Regression

• Overfitting avoidance: Regularization

– L1 vs L2 regularization

Page 24: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

What you should know about Logistic Regression (LR)

• Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR – Solution differs because of objective (loss) function

• In general, NB and LR make different assumptions – NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y)

• LR is a linear classifier – decision rule is a hyperplane

• LR optimized by conditional likelihood – no closed-form solution – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization

• Convergence rates – GNB (usually) needs less data – LR (usually) gets to better solutions in the limit

Page 25: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 26: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

From Logistic Regression to the Perceptron: 2 easy steps!

• Logistic Regression: (in vector notation): y is {0,1}

• Perceptron: y is {0,1}, y(x;w) is prediction given w

Differences?

•Drop the Σj over training examples: online vs. batch learning

•Drop the dist’n: probabilistic vs. error driven learning

Page 27: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Properties of Perceptrons

• Separability: some parameters get the training set perfectly correct

• Convergence: if the training is

separable, perceptron will eventually converge (binary case)

Separable

Non-Separable

Page 28: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Problems with the Perceptron

• Noise: if the data isn’t separable, weights might thrash – Averaging weight vectors over time

can help (averaged perceptron)

• Mediocre generalization: finds a “barely” separating solution

• Overtraining: test / validation accuracy usually rises, then falls – Overtraining is a kind of overfitting

Page 29: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 30: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas
Page 31: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Neural networks: What you should know?

• How does it learn non-linear functions?

• Can it learn, for example an XOR function? – Draw a neural network for it with appropriate weights

• Backprop

• Overfitting

• What kind of functions can it learn?

• Tradeoff – number of hidden units

– number of layers

Page 32: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Linear SVM

• Aim: Learn a large margin classifier

• Mathematical Formulation:

x1

x2

denotes +1

denotes -1

Margin

x+

x+

x-

such that

2maximize

w

For 1, 1

For 1, 1

T

i i

T

i i

y b

y b

w x

w x

Common theme in machine learning:

LEARNING IS OPTIMIZATION

Page 33: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Solving the Optimization Problem

2

1

1minimize ( , , ) ( ) 1

2

nT

p i i i i

i

L b y b

w w w x

s.t. 0i

1 1 1

1maximize

2

n n nT

i i j i j i j

i i j

y y

x x

s.t. 0i 1

0n

i i

i

y

, and

Lagrangian Dual

Problem

Page 34: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Non-linear SVMs: Feature Space

General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Page 35: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Nonlinear SVMs: The Kernel Trick

With this mapping, our discriminant function is now:

SV

( ) ( ) ( ) ( )T T

i i

i

g b b

x w x x x

No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

( , ) ( ) ( )T

i j i jK x x x x

Page 36: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Nonlinear SVMs: The Kernel Trick

Linear kernel:

2

2( , ) exp( )

2

i j

i jK

x xx x

( , ) T

i j i jK x x x x

( , ) (1 )T p

i j i jK x x x x

0 1( , ) tanh( )T

i j i jK x x x x

Examples of commonly-used kernel functions:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:

Sigmoid:

In general, functions that satisfy Mercer’s condition can be kernel functions: Kernel matrix should be positive semidefinite.

Page 37: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

K-nearest Neighbor

• Distance measure

– Most common: Euclidean

• Choosing k

– Increasing k reduces variance, increases bias

• For high-dimensional space, problem that the nearest neighbor may not be very close at all!

• Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.

Page 38: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Nearest Neighbor

• Advantages – variable-sized hypothesis space

– Learning is extremely efficient • however growing a good kd-tree can be expensive

– Very flexible decision boundaries

• Disadvantages – distance function must be carefully chosen

– Irrelevant or correlated features must be eliminated

– Typically cannot handle more than 30 features

– Computational costs: Memory and classification-time computation

Page 39: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Locally Weighted Linear Regression: LWLR

• Idea: – k-NN forms local approximation for each query

point xq

– Why not form an explicit approximation 𝑓 for region surrounding xq

• Fit linear function to k nearest neighbors

• Fit quadratic, ...

• Thus producing ``piecewise approximation'' to 𝑓 – Minimize error over k nearest neighbors of xq

– Minimize error entire set of examples, weighting by distances

– Combine two above

Page 40: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Bias-Variance-Noise Decomposition

] * y * yh(x*) 2 – h(x*) E[ ] – y*)(h(x*) E[ 222

] *E[y y*]E[h(x*)]E[ 2 – ]h(x*) E[ 22

22

22

f(x*)] -f(x*))*E[(y

f(x*)h(x*) 2 –

h(x*)]h(x*)-h(x*) E[

] -f(x*))*E[(y

f(x*) h(x*)

]h(x*)-h(x*) E[

2

22

2

BIAS

NOISE

VARIANCE

= Var(h(x*)) + Bias(h(x*))2 + E[2 ]

= Var(h(x*)) + Bias(h(x*))2 + 2

Expected prediction error = Variance + Bias2 + Noise2

Page 41: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Bias, Variance, and Noise • Variance: E[ (h(x*) – h(x*))2 ]

Describes how much h(x*) varies from one training set S to another

• Bias: [h(x*) – f(x*)]

Describes the average error of h(x*).

• Noise: E[ (y* – f(x*))2 ] = E[2] = 2

Describes how much y* varies from f(x*)

Page 42: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

42

Bias/Variance Tradeoff

• (bias2+variance) is what counts for prediction

• Often:

– low bias => high variance (too many parameters)

– low variance => high bias (too few parameters)

• Tradeoff:

– bias2 vs. variance

Page 43: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

43

Bagging: Bootstrap Aggregation

• Leo Breiman (1994)

• Take repeated bootstrap samples from training set D.

• Bootstrap sampling: Given set D containing N training examples, create D’ by drawing N examples at random with replacement from D.

• Bagging: – Create k bootstrap samples D1 … Dk.

– Train distinct classifier on each Di.

– Classify new instance by majority vote / average.

Page 44: Midterm Review CS 6375: Machine Learning - HLTRIvgogate/ml/2012s/notes/midterm-review.pdfMidterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas