Top Banner
Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg
55

Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

Dec 18, 2015

Download

Documents

Branden Mason
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

Introduction to Statistical Modeling and Machine

LearningLecture 8

Spoken Language Processing

Prof. Andrew Rosenberg

Page 2: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

2

What is Statistical Modeling

• Statistical Modeling is the process of using data to construct a mathematical or algorithmic device to measure the probability of some observation.

• Training– Using a set of observations to learn

parameters of a model, or construct the decision making process.

• Evaluation– Determining the probability of a new

observation

Page 3: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

3

What is a Statistical Model?

• Mathematically, it’s a function that maps observations to probabilities.

• Observations can be in – one dimension

• one number (numeric), one category (nominal)

– or in many dimensions • two numbers: height and weight, • a number and a category: height and gender

• Each dimension is called a feature

Page 4: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

4

What is Machine Learning?

• Automatically identifying patterns in data

• Automatically making decisions based on data

• Hypothesis:Data Learning Algorithm Behavior

Data Programmer or Expert Behavior

Page 5: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

5

Basics of Probabilities.

• Probabilities fall in the range [0,1]• Mutually Exclusive events are

events that cannot simultaneously occur.– The sum of the likelihoods of all

mutually exclusive events must be 1.

Page 6: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

6

Joint Probability

• We can represent the probability of more than one event at the same time.

• If two events are independent.

Page 7: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

7

Joint Probability Table

• A Joint Probability function defines the likelihood of two (or more) events occurring.

• Let nij be the number of times event i and event j simultaneously occur.

Orange Green

Blue box 1 3 4

Red box 6 2 8

7 5 12

Page 8: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

8

Marginalization

• Consider the probability of X irrespective of Y.

• The number of instances in column j is the sum of instances in each cell

• Therefore, we can marginalize or “sum over” Y:

Page 9: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

9

Conditional Probability

• Consider only instances where X = xj.

• The fraction of these instances where Y = yi is the conditional probability– “The probability of y given x”

Page 10: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

10

Relating the Joint Conditional and Marginal

Page 11: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

11

Sum and Product Rules

• In general, we’ll refer to a distribution over a random variable as p(X) and a distribution evaluated at a particular value as p(x).

Sum Rule

Product Rule

Page 12: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

12

Bayes Rule

Page 13: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

13

Interpretation of Bayes Rule

• Prior: Information we have before observation.

• Posterior: The distribution of Y after observing X

• Likelihood: The likelihood of observing X given Y

PriorPosterior

Likelihood

Page 14: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

14

Expected Values

• The expected value of a random variable is a weighted average.

• Expected valuesare used to determine what islikely to happen in a random setting

• Expectation– The expected value of a function is the hypothesis

• Variance– The variance is the confidence in that hypothesis

Page 15: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

15

What is a Probability?

• Frequentists– A probability is the likelihood that an

event will happen

– It is approximated by the ratio of the number of observed events to the number of total events

– Assessment is vital to selecting a model

– Point estimates are absolutely fine

Page 16: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

16

What is a Probability?

• Bayesians– A probability is a degree of believability of a

proposition.

– Bayesians require that probabilities be prior beliefs conditioned on data.

– The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment.

– If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior

Page 17: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

17

Boxes and Balls

• 2 Boxes, one red and one blue.• Each contain colored balls.

Page 18: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

18

Boxes and Balls

• Given some information about B and L, we want to ask questions about the likelihood of different events.

• What is the probability of selecting an apple?

• If I chose an orange ball, what is the probability that I chose from the blue box?

Page 19: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

19

Naïve Bayes Classification

• This is a simple case of a simple classification approach.

• Here the Box is the class, and the colored ball is a feature, or the observation.

• We can extend this Bayesian classification approach to incorporate more independent features.

Page 20: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

20

Naïve Bayes Classification

Page 21: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

21

Naïve Bayes Classification

• Assuming independence between the features given the class simplifies the math

Page 22: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

22

Argmax

• Identify the parameter that maximizes a function.

• When training a model, the goal is to maximize the likelihood of the model under some parameters.

• Since the log function is monotonic, optimizing a log transform of the likelihood is equivalent.

Page 23: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

23

Bernoulli Distribution

• Also known as a Binary Distribution.• Represented by a single parameter• Constrained version of the more

general, multinomial distribution

0.72 0.28

b 1-b

Page 24: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

24

Multinomial Distribution

• If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution.

• The probability of x being in state k is μk

0.1 0.1 0.5 0.2 0.1

Page 25: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

25

Gaussian Distribution

• One Dimension

• D-Dimensions

Page 26: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

26

Gaussian Distribution

Page 27: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

27

Gaussian Distributions

• We use Gaussian Distributions all over the place.

Page 28: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

28

Gaussian Distributions

• We use Gaussian Distributions all over the place.

Page 29: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

29

Supervised vs. Unsupervised Learning

• In supervised learning, the desired, target, or class value is known.

• In unsupervised learning, there is no observations of the target variable.

• Major Tasks– Regression

• Predict a numerical value from features i.e. “other information”

– Classification• Predict a categorical value

– Clustering• Identify groups of similar entities

Page 30: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

30

Graphical Example of Regression

?

Page 31: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

31

Graphical Example of Regression

Page 32: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

32

Graphical Example of Regression

Page 33: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

33

Graphical Example of Classification

Page 34: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

34

Graphical Example of Classification

?

Page 35: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

35

Graphical Example of Classification

?

Page 36: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

36

Graphical Example of Classification

Page 37: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

37

Graphical Example of Classification

Page 38: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

38

Graphical Example of Classification

Page 39: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

39

Decision Boundaries

Page 40: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

40

Graphical Example of Clustering

Page 41: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

41

Graphical Example of Clustering

Page 42: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

42

Graphical Example of Clustering

Page 43: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

43

Counting parameters

• The “size” of a statistical model is measured by the number of parameters that need to be trained.

• Bernouli distribution– one parameter

• Multinomial distribution– N-1 parameters

• 1-dimensional Gaussian– 2 parameter: mean and variance

• N-dimensional Gaussian– N-dimensional mean vector– N*N dimensional covariance matrix

Page 44: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

44

Curse of Dimensionality

• Increased number of features increases data needs exponentially.

• If 1 feature can be approximated with 10 observations, 2 features require 10*10

• Models should be “small” – few parameters / features – relative to the amount of available data.

Page 45: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

45

Overfitting

• Models with more parameters are more general.– I.e., Can represent more relationships

between variables

• More parameters can allow a statistical model to fit training data too well.

• Too well: When the model fails to generalize to unseen data.

Page 46: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

46

Overfitting

Page 47: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

47

Overfitting

Page 48: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

48

Overfitting

Page 49: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

49

Evaluation of Statistical Models

• Model Likelihood.• Calculate p(x; Θ) of new data x based

on trained parameters Θ.• The model parameters (almost

always) maximize the likelihood of the training data.

• Evaluate the likelihood of unseen – evaluation or testing – data.

Page 50: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

50

Evaluation of Statistical Models

• Evaluating Classifiers

• Accuracy is the most common and most intuitive calculation of performance of a classifier.

Page 51: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

51

Contingency Table

• Reports the confusion between True and Hypothesized classes

True Values

Positive Negative

Hyp Values

Positive True Positive

False Positive

Negative False Negative

True Negative

Page 52: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

52

Cross Validation

• Cross Validation is a technique to estimate the generalization performance of a classifier.

• Identify n “folds” of the available data.• Train on n-1 folds• Test on the remaining fold.• In the extreme (n=N) this is known as

“leave-one-out” cross validation• n-fold cross validation (xval) gives n samples

of the performance of the classifier.

Page 53: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

53

Caveats – Black Swans

• In the 17th Century, all known swans were white.

• Based on evidence, it is impossible for a swan to be anything other than white.

• In the 18th Century, black swans were discovered in Western Australia

• Black Swans are rare, sometimes unpredictable events, that have extreme impact

• Almost all statistical models underestimate the likelihood of unseen events.

Page 54: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

54

Caveats – The Long Tail

• Many events follow an exponential distribution

• These distributions have a very long “tail”.– I.e. A large region with

significant probability mass, but low likelihood at any particular point.

• Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region.

Page 55: Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg.

55

Next Class

• Gaussian Mixture Models• Reading: J&M 9.3