Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg
Introduction to Statistical Modeling and Machine
LearningLecture 8
Spoken Language Processing
Prof. Andrew Rosenberg
2
What is Statistical Modeling
• Statistical Modeling is the process of using data to construct a mathematical or algorithmic device to measure the probability of some observation.
• Training– Using a set of observations to learn
parameters of a model, or construct the decision making process.
• Evaluation– Determining the probability of a new
observation
3
What is a Statistical Model?
• Mathematically, it’s a function that maps observations to probabilities.
• Observations can be in – one dimension
• one number (numeric), one category (nominal)
– or in many dimensions • two numbers: height and weight, • a number and a category: height and gender
• Each dimension is called a feature
4
What is Machine Learning?
• Automatically identifying patterns in data
• Automatically making decisions based on data
• Hypothesis:Data Learning Algorithm Behavior
Data Programmer or Expert Behavior
≥
5
Basics of Probabilities.
• Probabilities fall in the range [0,1]• Mutually Exclusive events are
events that cannot simultaneously occur.– The sum of the likelihoods of all
mutually exclusive events must be 1.
6
Joint Probability
• We can represent the probability of more than one event at the same time.
• If two events are independent.
7
Joint Probability Table
• A Joint Probability function defines the likelihood of two (or more) events occurring.
• Let nij be the number of times event i and event j simultaneously occur.
Orange Green
Blue box 1 3 4
Red box 6 2 8
7 5 12
8
Marginalization
• Consider the probability of X irrespective of Y.
• The number of instances in column j is the sum of instances in each cell
• Therefore, we can marginalize or “sum over” Y:
9
Conditional Probability
• Consider only instances where X = xj.
• The fraction of these instances where Y = yi is the conditional probability– “The probability of y given x”
11
Sum and Product Rules
• In general, we’ll refer to a distribution over a random variable as p(X) and a distribution evaluated at a particular value as p(x).
Sum Rule
Product Rule
13
Interpretation of Bayes Rule
• Prior: Information we have before observation.
• Posterior: The distribution of Y after observing X
• Likelihood: The likelihood of observing X given Y
PriorPosterior
Likelihood
14
Expected Values
• The expected value of a random variable is a weighted average.
• Expected valuesare used to determine what islikely to happen in a random setting
• Expectation– The expected value of a function is the hypothesis
• Variance– The variance is the confidence in that hypothesis
15
What is a Probability?
• Frequentists– A probability is the likelihood that an
event will happen
– It is approximated by the ratio of the number of observed events to the number of total events
– Assessment is vital to selecting a model
– Point estimates are absolutely fine
16
What is a Probability?
• Bayesians– A probability is a degree of believability of a
proposition.
– Bayesians require that probabilities be prior beliefs conditioned on data.
– The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment.
– If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior
18
Boxes and Balls
• Given some information about B and L, we want to ask questions about the likelihood of different events.
• What is the probability of selecting an apple?
• If I chose an orange ball, what is the probability that I chose from the blue box?
19
Naïve Bayes Classification
• This is a simple case of a simple classification approach.
• Here the Box is the class, and the colored ball is a feature, or the observation.
• We can extend this Bayesian classification approach to incorporate more independent features.
21
Naïve Bayes Classification
• Assuming independence between the features given the class simplifies the math
22
Argmax
• Identify the parameter that maximizes a function.
• When training a model, the goal is to maximize the likelihood of the model under some parameters.
• Since the log function is monotonic, optimizing a log transform of the likelihood is equivalent.
23
Bernoulli Distribution
• Also known as a Binary Distribution.• Represented by a single parameter• Constrained version of the more
general, multinomial distribution
0.72 0.28
b 1-b
24
Multinomial Distribution
• If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution.
• The probability of x being in state k is μk
0.1 0.1 0.5 0.2 0.1
29
Supervised vs. Unsupervised Learning
• In supervised learning, the desired, target, or class value is known.
• In unsupervised learning, there is no observations of the target variable.
• Major Tasks– Regression
• Predict a numerical value from features i.e. “other information”
– Classification• Predict a categorical value
– Clustering• Identify groups of similar entities
43
Counting parameters
• The “size” of a statistical model is measured by the number of parameters that need to be trained.
• Bernouli distribution– one parameter
• Multinomial distribution– N-1 parameters
• 1-dimensional Gaussian– 2 parameter: mean and variance
• N-dimensional Gaussian– N-dimensional mean vector– N*N dimensional covariance matrix
44
Curse of Dimensionality
• Increased number of features increases data needs exponentially.
• If 1 feature can be approximated with 10 observations, 2 features require 10*10
• Models should be “small” – few parameters / features – relative to the amount of available data.
45
Overfitting
• Models with more parameters are more general.– I.e., Can represent more relationships
between variables
• More parameters can allow a statistical model to fit training data too well.
• Too well: When the model fails to generalize to unseen data.
49
Evaluation of Statistical Models
• Model Likelihood.• Calculate p(x; Θ) of new data x based
on trained parameters Θ.• The model parameters (almost
always) maximize the likelihood of the training data.
• Evaluate the likelihood of unseen – evaluation or testing – data.
50
Evaluation of Statistical Models
• Evaluating Classifiers
• Accuracy is the most common and most intuitive calculation of performance of a classifier.
51
Contingency Table
• Reports the confusion between True and Hypothesized classes
True Values
Positive Negative
Hyp Values
Positive True Positive
False Positive
Negative False Negative
True Negative
52
Cross Validation
• Cross Validation is a technique to estimate the generalization performance of a classifier.
• Identify n “folds” of the available data.• Train on n-1 folds• Test on the remaining fold.• In the extreme (n=N) this is known as
“leave-one-out” cross validation• n-fold cross validation (xval) gives n samples
of the performance of the classifier.
53
Caveats – Black Swans
• In the 17th Century, all known swans were white.
• Based on evidence, it is impossible for a swan to be anything other than white.
• In the 18th Century, black swans were discovered in Western Australia
• Black Swans are rare, sometimes unpredictable events, that have extreme impact
• Almost all statistical models underestimate the likelihood of unseen events.
54
Caveats – The Long Tail
• Many events follow an exponential distribution
• These distributions have a very long “tail”.– I.e. A large region with
significant probability mass, but low likelihood at any particular point.
• Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region.