Logistic Regression Lyle Ungar Learning objectives Logistic model & loss Decision boundaries as hyperplanes Multi-class regression
Logistic RegressionLyle Ungar
Learning objectivesLogistic model & lossDecision boundaries as hyperplanesMulti-class regression
What do you do with a binary y?u Can you use linear regression?
l y = wTxu How about a different link function?
l y = f(wTx)u Or a different probability distribution
l P(y=1|x) = f(wTx)
Logistic function
h✓(x) =1
1 + e�✓Tx
Logistic Regression
Log odds
Log likelihood of data
y = 1 or -1
Decision Boundary
Representing Hyperplanes• How do we represent a line?
• In general, a hyperplane is defined by
The red vector (w) defines the green hyper plane that is orthogonal to it.
Why bother with this weird representation?
0 = wTx
[1,-1]
Projections
Now classification is easy!
h(x) = sgn(wTx)
Computing MLEu Use gradient ascent
Loss function = log-likelihood
Computing MAPu Prior
u So solve
u Again use gradient descent
gg
gg g
2g2
2g2
Multi-Class Classification
Disease diagnosis: healthy / cold / flu / pneumoniaObject classification: desk / chair / monitor / bookcase
x1
x2
x1
x2
Binary classification:
Multi-class classification:
Multi-Class Logistic Regressionu For 2 classes:
u For K classes:
l Called the softmax functionn maps a vector to a probability distribution
h✓(x) =1
1 + exp(�✓Tx)=
exp(✓Tx)
1 + exp(✓Tx)h✓(x) =
1
1 + exp(�✓Tx)=
exp(✓Tx)
1 + exp(✓Tx)
weight assigned to
y = -1
weight assigned to
y = 1
p(y = k | x; ✓1, . . . , ✓k) = exp(✓>k x)PK
k=1 exp(✓>k x)
<latexit sha1_base64="Q9FhtzPFW1Pl5XW1deBWQxJu3BU=">AAACe3ichVHLahsxFNVMH0ndR9xkWSiiptQpxswklAaKwW03hWxSqJOA5QwajSYWI42EdKfEDLPNvv207rLIqj/RTaHyY1EnhR4QOpxzrx7npkYKB1F0FYR37t67v7H5oPXw0eMnW+2n28dOV5bxEdNS29OUOi5FyUcgQPJTYzlVqeQnafFx7p985dYJXX6BmeETRc9LkQtGwUtJ+7vpzvAAF5gokWGS5vVF826xE5hyoE0S9zCRmQbXW5OLXdwaYJJbymrCL0x3zTwjoM3quN2mJq5SSV0M4ubsEP+vOml3on60AL5N4hXpDN8ffrv+WV0eJe0fJNOsUrwEJqlz4zgyMKmpBcEkb1qkctxQVtBzPva0pIq7Sb3IrsEvvZLhXFu/SsAL9e+OmirnZir1lYrC1N305uK/vHEF+cGkFqWpgJdseVFeSQwazweBM2E5AznzhDIr/Fsxm1IfJ/hxtXwI8c0v3ybHe/14v7/3Oe4MP6AlNtEz9AJ1UYzeoiH6hI7QCDH0K3gevAq6we+wE74Oe8vSMFj17KA1hG/+ALC1xTU=</latexit>
Multi-Class Logistic Regression
u Train a logistic regression classifier for each class k to predict the probability that y = k with
x1
x2
Split into One vs. Rest:
hk(x) =exp(✓>
k x)Pkk=1 exp(✓k>x)
<latexit sha1_base64="Fch+YdYAD1Fzy16gzFuLgQRYv84=">AAACS3icbVC7ahtBFJ2VrcRWXpu4TDNYBBQCYlcp7EagJI1LBawHaKVldnRXGnb2wczdYLHsF+Srgps0adLlJ9K4sDEpMnoUluQDFw7nnMvMPUEmhUbH+WNVDg6rT54eHdeePX/x8pX9+k1fp7ni0OOpTNUwYBqkSKCHAiUMMwUsDiQMgujL0h98A6VFmlziIoNxzGaJCAVnaCTfDuZ+1PCCsLgq39M29ULFeOHBVbYSPZwDstKPJh6mGd3kysLTeewXUdstJxHdT2+Fa75dd5rOCnSfuBtS73yaXP/43v3Q9e3f3jTleQwJcsm0HrlOhuOCKRRcQlnzcg0Z4xGbwcjQhMWgx8Wqi5K+M8qUhqkykyBdqQ83ChZrvYgDk4wZzvWutxQf80Y5hufjQiRZjpDw9UNhLimmdFksnQoFHOXCEMaVMH+lfM5MnWjqX5bg7p68T/qtpvux2frq1jufyRpH5C05JQ3ikjPSIRekS3qEk5/kL7kld9Yv68a6t/6toxVrs3NCtlCp/gee47gC</latexit>
q1q3
q2
Implementing Multi-Class Logistic Regression
u P(y=k|x) estimated by:u Gradient descent simultaneously updates all
parameters for all modelsl Same derivative as before, just with the above hk(x)
u Predict class label as the most probable label
hk(x) =exp(✓>
k x)Pkk=1 exp(✓k>x)
<latexit sha1_base64="Fch+YdYAD1Fzy16gzFuLgQRYv84=">AAACS3icbVC7ahtBFJ2VrcRWXpu4TDNYBBQCYlcp7EagJI1LBawHaKVldnRXGnb2wczdYLHsF+Srgps0adLlJ9K4sDEpMnoUluQDFw7nnMvMPUEmhUbH+WNVDg6rT54eHdeePX/x8pX9+k1fp7ni0OOpTNUwYBqkSKCHAiUMMwUsDiQMgujL0h98A6VFmlziIoN xzGaJCAVnaCTfDuZ+1PCCsLgq39M29ULFeOHBVbYSPZwDstKPJh6mGd3kysLTeewXUdstJxHdT2+Fa75dd5rOCnSfuBtS73yaXP/43v3Q9e3f3jTleQwJcsm0HrlOhuOCKRRcQlnzcg0Z4xGbwcjQhMWgx8Wqi5K+M8qUhqkykyBdqQ83ChZrvYgDk4wZzvWutxQf80Y5hufjQiRZjpDw9UNhLimmdFksnQoFHOXCEMaVMH+lfM5MnWjqX5bg7p68T/qtpvux2frq1jufyRpH5C05JQ3ikjPSIRekS3qEk5/kL7kld9Yv68a6t/6toxVrs3NCtlCp/gee47gC</latexit>
You should knowu Logistic model & loss
l Linear in log-odds
u Decision boundariesl hyperplane
u Softmaxl Maps vector to probability distribution