Introduction to Neural Networks Fundamental ideas behind artificial neural networks Haidar Khan Bulent Yener
Introduction to Neural Networks
Fundamental ideas behind artificial neural networks
Haidar KhanBulent Yener
RPI โ Computer Science โ Haidar Khan
Outline
Introduction Machine Learning framework Neural Networks1. Simple linear models2. Nonlinear activations3. Gradient descent Demos
RPI โ Computer Science โ Haidar Khan
What are (artificial) neural networks
A technique to estimate patterns from data (~1940s)
Also called โmulti-layer perceptronsโ โneuralโ โ very crude mimicry of
how real biological neurons work Large network of simple units which
produce a complex output
RPI โ Computer Science โ Haidar Khan
Why do we care about them
Key ingredient in real AI Useful for industry problems Perform best on important tasks Yield insights into the biological brain (maybe)
RPI โ Computer Science โ Haidar Khan
General machine learning framework
Data โ ๐๐ ร ๐๐ matrix ๐๐ rows are observations ๐๐๐๐ (1 ร ๐๐)
Data labels - ๐๐ ร 1 vector ๐ฆ๐ฆ Assume there is some unknown
function ๐๐ ๏ฟฝ that generates the label ๐ฆ๐ฆ๐๐ given ๐๐๐๐:
๐๐ ๐๐๐๐ = ๐ฆ๐ฆ๐๐ ML problem: estimate ๐๐ ๏ฟฝ Use it to generate labels for new
observations!
๐๐ ๐๐
๐๐
๐๐
RPI โ Computer Science โ Haidar Khan
Some examplesโฆ
Problem Data Data labels119 images of cats and dogs (20 x 20 pixels)
119 ร 400 matrix of pixeldata(we stretch each image into a long vector)
{Cat, Dog}
A 15 question political poll of 139 residents on recent state legislation
139 ร 15 matrix of answers (A-E)
Party affiliation: {Republican, Democrat, Independent}
RPI โ Computer Science โ Haidar Khan
Recall: Linear regression
Assume the generating function ๐๐(๏ฟฝ) is linear Write label ๐ฆ๐ฆ๐๐ as a linear function of ๐๐: ๐ฆ๐ฆ๐๐ = ๐๐๐๐๐๐ Matrix form: ๐๐ = ๐๐๐๐
What should the ๐๐ ร 1 vector ๐๐ be? This is the familiar least squares regression:
๐๐ = ๐๐๐๐๐๐ โ1๐๐๐๐๐๐ We will set up the simplest neural network and show we arrive
at this same solution!
RPI โ Computer Science โ Haidar Khan
Declare a simple neural network
Recall ๐๐ is 1 ร ๐๐ One artificial neural unit Connects to each input ๐ฅ๐ฅ๐๐ with a
weight ๐ค๐ค๐๐ Produces one output ๐ง๐ง
๐ง๐ง = ๏ฟฝ๐๐
๐๐
๐ฅ๐ฅ๐๐๐ค๐ค๐๐http://nikhilbuduma.com
๐ง๐ง
ร
RPI โ Computer Science โ Haidar Khan
Set an objective to learn
Want network outputs ๐ง๐ง๐๐ to match labels ๐ฆ๐ฆ๐๐ Choose a loss function ๐ธ๐ธ and optimize w.r.t the weights
๐ธ๐ธ =12๏ฟฝ๐๐
๐๐
๐ง๐ง๐๐ โ ๐ฆ๐ฆ๐๐ 2
๐ธ๐ธ =12๏ฟฝ๐๐
๐๐
๐๐๐๐๐๐ โ ๐ฆ๐ฆ๐๐ 2
How to minimize ๐ธ๐ธ with respect to ๐๐?
RPI โ Computer Science โ Haidar Khan
Equivalence to least squares
Take the derivative and set it to zero:๐๐๐ธ๐ธ๐๐๐๐
= ๏ฟฝ๐๐
๐๐
๐๐๐๐๐๐ โ ๐ฆ๐ฆ๐๐ ๐๐๐๐๐๐
๏ฟฝ๐๐
๐๐
๐๐๐๐๐๐๐๐๐๐๐๐ โ ๐๐๐๐๐๐๐ฆ๐ฆ๐๐ = ๐๐
Written in matrix form this becomes:๐๐๐๐๐๐๐๐ โ ๐๐๐๐๐๐ = ๐๐๐๐ = ๐๐๐๐๐๐ โ1๐๐๐๐๐๐
RPI โ Computer Science โ Haidar Khan
Key idea: compose simple units
Where do we go from here? Use many of these simple units and
compose them in layers: Function composition: ๐๐(โ ๏ฟฝ )
Each layer learns a new representation of the data 3 layer network: ๐ง๐ง๐๐ = โ3 โ2 โ1 ๐๐๐๐
http://neuralnetworksanddeeplearning.com
RPI โ Computer Science โ Haidar Khan
Drawback to only linear units
Recall our earlier assumption that ๐๐(๏ฟฝ) is linear This is a very restrictive assumption
Furthermore, composing strictly linear models is also linear!๐ง๐ง๐๐ = โ3 โ2 โ1 ๐๐๐๐ = ๐๐3๐๐2๐๐1๐๐๐๐ = ๐๐123๐๐๐๐
XOR problem (Minsky, Papert 1969)
RPI โ Computer Science โ Haidar Khan
XOR problem
Canโt learn a simple XOR gate using only one straight line
RPI โ Computer Science โ Haidar Khan
Key idea: non-linear activations
Solution: add a non-linear function at the output of each layer What kind of function? Differentiable at least:
Hyperbolic tangent: ๐ง๐ง = tanh(๐๐๐๐๐๐๐๐)
Sigmoid: ๐ง๐ง = 1
1+๐๐โ๐๐๐๐๐๐๐๐
Rectified Linear: ๐ง๐ง = max 0,๐๐๐๐๐๐๐๐ Why? Labels ๐๐ can be a non-linear function of the inputs (like
XOR)
RPI โ Computer Science โ Haidar Khan
Examples of non-linear activations
http://ufldl.stanford.edu
RPI โ Computer Science โ Haidar Khan
How do we learn weights now?
With multiple layers and non-linear activation functions we canโt simply take the derivative and set it to 0
Still can set a loss function and: Randomly try different weights Numerically estimate the derivative
๐๐โฒ ๐ฅ๐ฅ =๐๐ ๐ฅ๐ฅ + โ โ ๐๐ ๐ฅ๐ฅ
โ Terribly inefficient and scale badly with the number of layersโฆ
RPI โ Computer Science โ Haidar Khan
Key idea: gradient descent on loss function
Suppose we could calculate the partial derivative of ๐ธ๐ธ w.r.t each weight ๐ค๐ค๐๐:
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ค๐ค๐๐
(gradient)
Decrease the loss function ๐ธ๐ธ by updating weights:
๐ค๐ค๐๐ = ๐ค๐ค๐๐ +๐ฟ๐ฟ๐ธ๐ธ๐ฟ๐ฟ๐ค๐ค๐๐
Repeatedly doing this process is called gradient descent
Leads to a set of weights that correspond to a local minimum of the loss function
RPI โ Computer Science โ Haidar Khan
Backpropagation to estimate gradients
One of the breakthroughs in neural network research Allows to calculate the gradients of the network! Core idea behind the algorithm is multiple applications of the
chain rule of derivatives:๐น๐น ๐ฅ๐ฅ = ๐๐ ๐๐ ๐ฅ๐ฅ
๐น๐นโฒ ๐ฅ๐ฅ = ๐๐โฒ ๐๐ ๐ฅ๐ฅ ๐๐โฒ ๐ฅ๐ฅ Two passes through the network: forward and backward
Forward: calculate ๐๐(๐ฅ๐ฅ) and then ๐๐(๐๐ ๐ฅ๐ฅ ) Backward: calculate ๐๐๐(๐๐ ๐ฅ๐ฅ ) and then ๐๐๐(๐ฅ๐ฅ)
RPI โ Computer Science โ Haidar Khan
Multilayer Backpropagation
Assume we have ๐ก๐ก๐๐ , ๐ก๐ก๐๐ , ๐ง๐ง๐๐ from the forward pass Work backward from the output of the network:
๐ธ๐ธ = 12โ๐๐โ๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ก๐๐ โ ๐ฆ๐ฆ๐๐
2, ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
= โ ๐ก๐ก๐๐ โ ๐ฆ๐ฆ๐๐ (for output neurons)
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
= โ๐๐๐๐๐ง๐ง๐๐๐๐๐๐๐๐
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ง๐ง๐๐
= โ๐๐๐ค๐ค๐๐๐๐๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ง๐ง๐๐
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ง๐ง๐๐
=๐ฟ๐ฟ๐๐๐๐๐ฟ๐ฟ๐ง๐ง๐๐
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
= ๐ก๐ก๐๐๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
= โ๐๐๐ค๐ค๐๐๐๐ ๐ก๐ก๐๐๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ค๐ค๐๐๐๐
=๐ฟ๐ฟ๐ง๐ง๐๐๐ฟ๐ฟ๐ค๐ค๐๐๐๐
๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ง๐ง๐๐
= ๐ก๐ก๐๐๐ก๐ก๐๐๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐ฟ๐๐๐๐
http://nikhilbuduma.com
RPI โ Computer Science โ Haidar Khan
Putting all the pieces together
3 key elements to understanding neural networks Composition of units with simple operations (dot-product) Non-linearity activation functions at unit outputs Learn weights using gradient descent
Using neural networks: Set up data matrix and label vector: ๐๐ and ๐ฆ๐ฆ Define a network architecture: number of layers, units per layer Choose a loss function to minimize: depends on the task
RPI โ Computer Science โ Haidar Khan
A couple of demosโฆ
RPI โ Computer Science โ Haidar Khan
Credits
Images from: http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/ http://ufldl.stanford.edu http://neuralnetworksanddeeplearning.com