Neural Networks Neural Networks Introduction to Introduction to Artificial Intelligence Artificial Intelligence COS302 COS302 Michael L. Littman Michael L. Littman Fall 2001 Fall 2001
Neural NetworksNeural Networks
Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence
COS302COS302
Michael L. LittmanMichael L. Littman
Fall 2001Fall 2001
AdministrationAdministration
11/28 Neural Networks11/28 Neural Networks
Ch. 19 [19.3, 19.4]Ch. 19 [19.3, 19.4]
12/03 Latent Semantic Indexing12/03 Latent Semantic Indexing
12/05 Belief Networks12/05 Belief Networks
Ch. 15 [15.1, 15.2]Ch. 15 [15.1, 15.2]
12/10 Belief Network Inference12/10 Belief Network Inference
Ch. 19 [19.6]Ch. 19 [19.6]
ProposalProposal
11/28 Neural Networks11/28 Neural Networks
Ch. 19 [19.3, 19.4]Ch. 19 [19.3, 19.4]
12/03 Backpropagation in NNs12/03 Backpropagation in NNs
12/05 Latent Semantic Indexing12/05 Latent Semantic Indexing
12/10 Segmentation12/10 Segmentation
Regression: DataRegression: Data
xx11= 2= 2yy11= 1= 1
xx22== 66 yy22= 2.2= 2.2
xx33== 44 yy33= 2= 2
xx44== 33 yy44= 1.9= 1.9
xx55== 44 yy55= 3.1= 3.1
Given x, want to predict y.Given x, want to predict y.
Regression: PictureRegression: Picture
0
0.5
1
1.5
2
2.5
3
3.5
0 2 4 6 8
Linear RegressionLinear Regression
Linear regression assumes that the Linear regression assumes that the expected value of the output given expected value of the output given an input E(y|x) is linear.an input E(y|x) is linear.
Simplest case:Simplest case:
out(x) = w xout(x) = w x
for some unknown for some unknown weightweight w. w.
Estimate w given the data.Estimate w given the data.
1-Parameter Linear Reg.1-Parameter Linear Reg.
Assume that the data is formed byAssume that the data is formed by
yyi i = w x= w xi i + noise+ noise
where…where…• the noise signals are indep.the noise signals are indep.• noise normally distributed: mean 0 noise normally distributed: mean 0
and unknown variance and unknown variance 22
Distribution for ysDistribution for ys
wx wx wx+wx+wx-wx-
wx+2wx+2wx-2wx-2
Pr(y|w, x) normally distributed with Pr(y|w, x) normally distributed with mean wx and variance mean wx and variance 22
Data to ModelData to Model
Fix xs. What w makes ys most likely? Fix xs. What w makes ys most likely? Also known as…Also known as…
argmaxargmaxw w Pr(yPr(y11…y…ynn|x|x11…x…xnn, w), w)
= argmax= argmaxw w i i Pr(yPr(yii|x|xii, w), w)
= argmax= argmaxw w i i exp(-1/2 ((yexp(-1/2 ((yii-wx-wxii)/)/))22))
= argmin= argminw w i i (y(yii-wx-wxii))22
Minimize sum-of-squared Minimize sum-of-squared residualsresiduals..
ResidualsResiduals
0
0.5
1
1.5
2
2.5
3
3.5
0 2 4 6 8
i
How Minimize?How Minimize?
E = E = i i (y(yii-wx-wxii))22
= = i i yyii2 2 – (2 – (2 ii x xi i yyii) w + () w + (i i xxii
22) w) w22
Minimize quadratic function of w.Minimize quadratic function of w.
E minimized withE minimized with
w* = (w* = (i i xxi i yyii) / () / (i i xxii22))
so ML model is Out(x) = w* x.so ML model is Out(x) = w* x.
Multivariate RegressionMultivariate Regression
What if inputs are vectors?What if inputs are vectors?
n data points, D componentsn data points, D components
xx11
xxnn
…X =X =yy11
yynn
…Y =Y =
DD
Closed Form SolutionClosed Form Solution
Multivariate linear regression Multivariate linear regression assumes a vector w s.t.assumes a vector w s.t.
Out(x) = wOut(x) = wTTx x
= w[1] x[1] + … + w[D] x[D]= w[1] x[1] + … + w[D] x[D]
ML solution: w = (XML solution: w = (XTTX)X)–1–1 (X (XTTY)Y)
XXTTX is DxD, k,j elt is sumX is DxD, k,j elt is sumi i xxij ij xxikik
XXTTY is Dx1, k elt is sumY is Dx1, k elt is sumi i xxik ik yyii
Got Constants?Got Constants?
0
2
4
6
8
10
0 2 4 6 8
Fitting with an OffsetFitting with an Offset
We might expect a linear function We might expect a linear function that doesn’t go through the origin.that doesn’t go through the origin.
Simple obvious hack so we don’t Simple obvious hack so we don’t have to start from scratch…have to start from scratch…
Gradient DescentGradient Descent
Scalar function: f(w): Scalar function: f(w): Want a local minimum.Want a local minimum.
Start with some value for w.Start with some value for w.
Gradient descent rule:Gradient descent rule:
w w w - w - //w f(w)w f(w) ““learning rate” (small pos. num.)learning rate” (small pos. num.)
Justify!Justify!
Partial DerivativesPartial Derivatives
E = sumE = sumk k (w(wTTxxk k – y– ykk))2 2 = f(w)= f(w)
wwj j w wj j - - //wwj j f(w) f(w)
How would a small increase in How would a small increase in weight wweight wj j change the error?change the error?
Small positive?Small positive?Large positive?Large positive?
Small negative?Small negative? Large negative?Large negative?
Neural Net ConnectionNeural Net Connection
Set of weights w.Set of weights w.
Find weights to minimize sum-of-Find weights to minimize sum-of-squared residuals. Why?squared residuals. Why?
When would we want to use gradient When would we want to use gradient descent?descent?
Linear PerceptronLinear Perceptron
Earliest, simplest NN.Earliest, simplest NN.
xx11
yy
xx22 xx33 xxDD…
sumsum
ww11
11
ww22 ww33wwDD ww00
Learning RuleLearning Rule
Multivariate linear function, trained Multivariate linear function, trained by gradient descent.by gradient descent.
Derive the update rule…Derive the update rule…
out(x) = wout(x) = wTTxx
E = sumE = sumk k (w(wTTxxk k – y– ykk))2 2 = f(w)= f(w)
wwj j w wj j - - //wwj j f(w) f(w)
““Batch” AlgorithmBatch” Algorithm
1.1. Randomly initialize wRandomly initialize w11…w…wDD
2.2. Append 1s to inputs to allow Append 1s to inputs to allow function to miss the originfunction to miss the origin
3.3. For i=1 to n, For i=1 to n, i i = y= yi i – w– wTT xxii
4.4. For j=1 to D, wFor j=1 to D, wjj= w= wj j + + sum sumii i i xxijij
5.5. If sumIf sumii ii2 2 is small, stop, else 3.is small, stop, else 3.
Why squared?Why squared?
ClassificationClassification
Let’s say all outputs are 0 or 1.Let’s say all outputs are 0 or 1.
How can we interpret the output of How can we interpret the output of the perceptron as zero or one?the perceptron as zero or one?
ClassificationClassification
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 5 10
Change Output FunctionChange Output Function
SolutionSolution::
Instead ofInstead of out(x) = wout(x) = wTTxx
we’ll use we’ll use
out(x) = g(wout(x) = g(wTTx)x)
g(x): g(x): (0,1), squashing function (0,1), squashing function
SigmoidSigmoid
E = sumE = sumk k (g(w(g(wTTxxkk)) –– yykk))2 2 = f(w)= f(w)
where g(h) = 1/(1+ewhere g(h) = 1/(1+e-h-h))
0
0.2
0.4
0.6
0.8
1
0 5 10
Classification Percept.Classification Percept.
xx11
netnetii
xx22 xx33 xxDD…
sumsum
ww11
11
ww22 ww33wwDD ww00
yy squashsquash
gg
Classifying RegionsClassifying Regions
xx22
xx11
1111
11
00
00
00
Gradient Descent in Gradient Descent in PerceptronsPerceptrons
Notice g’(h) = g(h)(1-g(h)).Notice g’(h) = g(h)(1-g(h)).
Let netLet neti i = sum= sumk k wwk k xxikik, , ii = y = yii-g(net-g(netii) )
out(xout(xii) = g(net) = g(netii))
E = sumE = sumi i (y(yii-g(net-g(netii))))22
E/E/wwj j = sum= sumi i 2(y2(yii-g(net-g(netii)) (-)) (-//wwj j g(netg(netii))))
= –2 sum= –2 sumi i (y(yii-g(net-g(netii)) g’(net)) g’(netii) ) //wwj j netnetii
= –2 sum= –2 sumi i ii g(net g(netii) (1-g(net) (1-g(netii)) x)) xii
Delta Rule for PerceptronsDelta Rule for Perceptrons
wwj j = w= wjj
+ + sum sumii i i out(xout(xii) (1-out(x) (1-out(xii)) x)) xijij
Invented and popularized by Invented and popularized by Rosenblatt (1962)Rosenblatt (1962)
Guaranteed convergenceGuaranteed convergenceStable behavior for overconstrained Stable behavior for overconstrained
and underconstrained problemsand underconstrained problems
What to LearnWhat to Learn
Linear regression as MLLinear regression as ML
Gradient descent to find MLGradient descent to find ML
Perceptron training rule (regressions Perceptron training rule (regressions version and classification version)version and classification version)
Sigmoids for classification problemsSigmoids for classification problems
Homework 9 (due 12/5)Homework 9 (due 12/5)
1.1. Write a program that decides if a pair Write a program that decides if a pair of words are synonyms using wordnet. of words are synonyms using wordnet. I’ll send you the list, you send me the I’ll send you the list, you send me the answers.answers.
2.2. Draw a decision tree that represents Draw a decision tree that represents (a) f(a) f11+f+f22+…+f+…+fnn (or), (b) f (or), (b) f11ff22…f…fn n (and), (and), (c) parity (odd number of features (c) parity (odd number of features “on”).“on”).
3.3. Show that g’(h) = g(h)(1-g(h)).Show that g’(h) = g(h)(1-g(h)).