Review of Lecture Two

Review of Lecture Two• Linear Regression

– Cost Function– Gradient Decent

• Normal Equation– (XTX)-1

• Probabilistic Interpretation– Maximum Likelihood Estimation vs. Linear Regression– Gaussian Distribution of the Data

• Generative vs Discriminative

General Linear Regression MethodsImportant Implications

• Recall q, a column vector (1 for the intercept q0 + n parameters), can be obtained from:

• When the X variables are linearly independent (XTX being full rank), there is a unique solution to the normal equations;

• The inversion of XTX depends on the existence of XTXX=X, that is to find a matrix equivalent of a numerical reciprocal;

• Only models with a single output variable can be trained.

( ) ( ) ( )i T i iy x

1( )T TX X X y ��

• Assume data are i.i.d. (independently identically distributed)– Likelihood of L(q) = the probability of y given x parameterized by q

• What is Maximum Likelihood Estimation (MLE)?– Chose parameters q to maximize the function , so to

make the training data set as probable as possible.

Maximum Likelihood Estimation

( ) ( | ; )L p y X ��

( ; , ) | )( () ;L X y p yL X ��

The Connection Between MLE and OLE

• Chose parameters q to maximize the data likelihood:

• Equivalent to minimize

,( )) ( ;L XL y ��

The Equivalence of MLE and OLE

= J(q) !?

Today’s Content• Logistic Regression

– Discrete Output– Connection to MLE

• The Exponential Family– Bernoulli– Gaussian

• Generalized Linear Models (GLMs)

Sigmoid (Logistic) Function

{0,1}

( ) {0,1}

1( )

11

( ) ( )1

T

z

T

x

y

h x

g ze

h x g xe

Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize LinearMethods) that the choice of the logistic function is a natural one.

Gradient Assent for MLE of the Logistic Function

Recall

Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:

Given

: ( )lx

One Useful Property of the Logistic Function

Identical to Least Square Again?

( ) ( ) ( )

1

( ) ( ( ))m

i i ij

ij

l y h x x

( ) ( )

1

( ): ( )( )m

i ij

ij j

i

hy x x

Discriminative vs. Generative Algorithms• Discriminative Learning

– Either Learn p(y|x) directly, or learn hq {1,0} that given x, the hypothesis will output {1,0} directly;

– Logistic regression is an example of discriminative learning algorithm;

• In Contrast, Generative Learning– Build the probabilistic distribution of x conditioned for

each of the classes, p(x|y=1) and p(x|y=0), respectively; – Also build the probabilistic distribution of p(y=1) or p(y=0),

as the class priors (or the weights); – Use the Bayes Rule to compare the p(x|y) given y=1 or

y=0, i.e., to see which one is more likely;

QuestionFor P(y|x; q)• We learn q in order to maximize the P(y I x; )q• When we do so:

– If y ~ Gaussian, we use Least Square Regression– If y {0,1} ~ Bernoulli, we use Logistic Regression

Why ? Any natural reasons?

Any Probabilistic, Linear, and General (PLG), Learning Framework?

A web-site visiting problem, for a PLG solution

Generalized Linear ModelsThe Exponential Family

( ; ) ( )exp( )( )( )TT ap b yy y

( )T y

( )a

Natural (distribution) Parameter

Sufficient Statistics, often T(y) = y

Normalization Term

1. A fixed choice of T, a, and b defines a set of distributions that is parameterized by h; as we vary h we will get different distributions within this family (affecting the mean).

2. Bernoulli, Gaussian, and other distributions are examples of exponential family distributions.

3. A way of unifying various statistical models, like linear regression, logistic regression and Poisson regression, into one framework.

Examples of distributions in the exponential family

• Gaussian• Bernoulli• Binomial• Multinomial• Chi-square• Exponential• Poisson• Beta• …

BernoulliY | x; q ~ ExpFamily (h), here we chose a, b, T to be the specific form to cause the distribution to be Bernoulli.

For any fixed x, q, we hope that our algorithm will output

hq(x) = E[y|x;q) = p (y=1|x;q) = f

= 1/(1+e-h)

= 1/(1+eqTx)

If you recall that the form of logistic function being 1/(1+e-z), now you should understand why we chose the logistic form for a learning process if my data mimics a Bernoulli distribution.

To Build GLM

1. p = (y|x ; ) q where y belongs to a distribution of the Exponential Family (h) given x and q

2. Given x, our goal is to output E[T(y)|x]i.e., we want h(x) = E[T(y)|x] (Note for most cases, T(y) = y)

3. Think about the relationship between the input x and the parameter h, which we hope to use h to define my desired distribution, according to

h=qTX (linear, as my design choice), h is a number or a vector

Distribution Name Link Function Mean

Normal Identity qTX = μ μ = qTX

ExponentialInverse qTX = μ-1 μ = (qTX)-1

Gamma

Poisson Log qTX = ln(μ) μ = exp(qTX)

BinomialLogit

Multinomial

Generalized Linear ModelsThe Exponential Family

ln( )1

T X

1

1 exp( )T X

More precisely…

A flexible generalization of ordinary least squares regression that relates the random distribution of the distribution function to the systematic portion of the linear predictor through a function called the link function.

ExtensionsThe standard GLZ assumes that the observations are uncorrelated (i.i.d.)

Models that deal with correlated data are extensions of GLZ’s.

• Generalized estimating equations: Use population-averaged effects.

• Generalized linear mixed models: A type of multilevel model (mixed model), an extension of logistic regression.

• Hierarchical generalized linear models: similar to generalized linear mixed models, apart from two distinctions:– The random effects can have any distribution in the exponential family, whereas

current linear mixed models nearly always have normal random effects;– Computationally less complex than linear mixed models.

Summary• GLM is a flexible generalization of ordinary least squares regression.

• GLM generalizes linear regression by allowing the linear model to be related to the output variable via a link function and by allowing the magnitude of the variance of each feature to be a function of its predicted value.

• GLMs are of unifying various other statistical models, including linear, logistic, …, and Poisson regressions, under one framework.

• This allowed us to develop a general algorithm for maximum likelihood estimation in all these models.

• It extends naturally to encompass many other models as well.

• In a GLM, the output is thus assumed to be generated from a particular distribution function of the exponential family.

Review of Lecture Two

Documents

parameters q

data likelihood

logistic function identical

logistic function recall

probability of y

x variables

x qwhen

training example x