Review of Lecture Two • Linear Regression – Cost Function – Gradient Decent • Normal Equation – (X T X) -1 • Probabilistic Interpretation – Maximum Likelihood Estimation vs. Linear Regression – Gaussian Distribution of the Data • Generative vs Discriminative
Review of Lecture Two. Linear Regression Cost Function Gradient Decent Normal Equation (X T X) -1 Probabilistic Interpretation Maximum Likelihood Estimation vs. Linear Regression Gaussian Distribution of the Data Generative vs Discriminative. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Review of Lecture Two• Linear Regression
– Cost Function– Gradient Decent
• Normal Equation– (XTX)-1
• Probabilistic Interpretation– Maximum Likelihood Estimation vs. Linear Regression– Gaussian Distribution of the Data
• Generative vs Discriminative
General Linear Regression MethodsImportant Implications
• Recall q, a column vector (1 for the intercept q0 + n parameters), can be obtained from:
• When the X variables are linearly independent (XTX being full rank), there is a unique solution to the normal equations;
• The inversion of XTX depends on the existence of XTXX=X, that is to find a matrix equivalent of a numerical reciprocal;
• Only models with a single output variable can be trained.
( ) ( ) ( )i T i iy x
1( )T TX X X y ��������������
• Assume data are i.i.d. (independently identically distributed)– Likelihood of L(q) = the probability of y given x parameterized by q
• What is Maximum Likelihood Estimation (MLE)?– Chose parameters q to maximize the function , so to
make the training data set as probable as possible.
Maximum Likelihood Estimation
( ) ( | ; )L p y X ��������������
( ; , ) | )( () ;L X y p yL X �������������� ��������������
The Connection Between MLE and OLE
• Chose parameters q to maximize the data likelihood:
• Equivalent to minimize
,( )) ( ;L XL y ��������������
The Equivalence of MLE and OLE
= J(q) !?
Today’s Content• Logistic Regression
– Discrete Output– Connection to MLE
• The Exponential Family– Bernoulli– Gaussian
• Generalized Linear Models (GLMs)
Sigmoid (Logistic) Function
{0,1}
( ) {0,1}
1( )
11
( ) ( )1
T
z
T
x
y
h x
g ze
h x g xe
Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize LinearMethods) that the choice of the logistic function is a natural one.
Gradient Assent for MLE of the Logistic Function
Recall
Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:
Given
: ( )lx
One Useful Property of the Logistic Function
Identical to Least Square Again?
( ) ( ) ( )
1
( ) ( ( ))m
i i ij
ij
l y h x x
( ) ( )
1
( ): ( )( )m
i ij
ij j
i
hy x x
Discriminative vs. Generative Algorithms• Discriminative Learning
– Either Learn p(y|x) directly, or learn hq {1,0} that given x, the hypothesis will output {1,0} directly;
– Logistic regression is an example of discriminative learning algorithm;
• In Contrast, Generative Learning– Build the probabilistic distribution of x conditioned for
each of the classes, p(x|y=1) and p(x|y=0), respectively; – Also build the probabilistic distribution of p(y=1) or p(y=0),
as the class priors (or the weights); – Use the Bayes Rule to compare the p(x|y) given y=1 or
y=0, i.e., to see which one is more likely;
QuestionFor P(y|x; q)• We learn q in order to maximize the P(y I x; )q• When we do so:
– If y ~ Gaussian, we use Least Square Regression– If y {0,1} ~ Bernoulli, we use Logistic Regression
Why ? Any natural reasons?
Any Probabilistic, Linear, and General (PLG), Learning Framework?
A web-site visiting problem, for a PLG solution
Generalized Linear ModelsThe Exponential Family
( ; ) ( )exp( )( )( )TT ap b yy y
( )T y
( )a
Natural (distribution) Parameter
Sufficient Statistics, often T(y) = y
Normalization Term
1. A fixed choice of T, a, and b defines a set of distributions that is parameterized by h; as we vary h we will get different distributions within this family (affecting the mean).
2. Bernoulli, Gaussian, and other distributions are examples of exponential family distributions.
3. A way of unifying various statistical models, like linear regression, logistic regression and Poisson regression, into one framework.
Examples of distributions in the exponential family
BernoulliY | x; q ~ ExpFamily (h), here we chose a, b, T to be the specific form to cause the distribution to be Bernoulli.
For any fixed x, q, we hope that our algorithm will output
hq(x) = E[y|x;q) = p (y=1|x;q) = f
= 1/(1+e-h)
= 1/(1+eqTx)
If you recall that the form of logistic function being 1/(1+e-z), now you should understand why we chose the logistic form for a learning process if my data mimics a Bernoulli distribution.
To Build GLM
1. p = (y|x ; ) q where y belongs to a distribution of the Exponential Family (h) given x and q
2. Given x, our goal is to output E[T(y)|x]i.e., we want h(x) = E[T(y)|x] (Note for most cases, T(y) = y)
3. Think about the relationship between the input x and the parameter h, which we hope to use h to define my desired distribution, according to
h=qTX (linear, as my design choice), h is a number or a vector
Distribution Name Link Function Mean
Normal Identity qTX = μ μ = qTX
ExponentialInverse qTX = μ-1 μ = (qTX)-1
Gamma
Poisson Log qTX = ln(μ) μ = exp(qTX)
BinomialLogit
Multinomial
Generalized Linear ModelsThe Exponential Family
ln( )1
T X
1
1 exp( )T X
More precisely…
A flexible generalization of ordinary least squares regression that relates the random distribution of the distribution function to the systematic portion of the linear predictor through a function called the link function.
ExtensionsThe standard GLZ assumes that the observations are uncorrelated (i.i.d.)
Models that deal with correlated data are extensions of GLZ’s.
• Generalized estimating equations: Use population-averaged effects.
• Generalized linear mixed models: A type of multilevel model (mixed model), an extension of logistic regression.
• Hierarchical generalized linear models: similar to generalized linear mixed models, apart from two distinctions:– The random effects can have any distribution in the exponential family, whereas
current linear mixed models nearly always have normal random effects;– Computationally less complex than linear mixed models.
Summary• GLM is a flexible generalization of ordinary least squares regression.
• GLM generalizes linear regression by allowing the linear model to be related to the output variable via a link function and by allowing the magnitude of the variance of each feature to be a function of its predicted value.
• GLMs are of unifying various other statistical models, including linear, logistic, …, and Poisson regressions, under one framework.
• This allowed us to develop a general algorithm for maximum likelihood estimation in all these models.
• It extends naturally to encompass many other models as well.
• In a GLM, the output is thus assumed to be generated from a particular distribution function of the exponential family.