Top Banner
Introduction to Neural Networks Fundamental ideas behind artificial neural networks Haidar Khan Bulent Yener
22

Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

Sep 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

Introduction to Neural Networks

Fundamental ideas behind artificial neural networks

Haidar KhanBulent Yener

Page 2: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Outline

Introduction Machine Learning framework Neural Networks1. Simple linear models2. Nonlinear activations3. Gradient descent Demos

Page 3: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

What are (artificial) neural networks

A technique to estimate patterns from data (~1940s)

Also called โ€œmulti-layer perceptronsโ€ โ€œneuralโ€ โ€“ very crude mimicry of

how real biological neurons work Large network of simple units which

produce a complex output

Page 4: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Why do we care about them

Key ingredient in real AI Useful for industry problems Perform best on important tasks Yield insights into the biological brain (maybe)

Page 5: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

General machine learning framework

Data โ€“ ๐‘›๐‘› ร— ๐‘š๐‘š matrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™๐‘–๐‘– (1 ร— ๐‘š๐‘š)

Data labels - ๐‘›๐‘› ร— 1 vector ๐‘ฆ๐‘ฆ Assume there is some unknown

function ๐‘“๐‘“ ๏ฟฝ that generates the label ๐‘ฆ๐‘ฆ๐‘–๐‘– given ๐’™๐’™๐‘–๐‘–:

๐‘“๐‘“ ๐’™๐’™๐‘–๐‘– = ๐‘ฆ๐‘ฆ๐‘–๐‘– ML problem: estimate ๐‘“๐‘“ ๏ฟฝ Use it to generate labels for new

observations!

๐‘‹๐‘‹ ๐’š๐’š

๐‘›๐‘›

๐‘š๐‘š

Page 6: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Some examplesโ€ฆ

Problem Data Data labels119 images of cats and dogs (20 x 20 pixels)

119 ร— 400 matrix of pixeldata(we stretch each image into a long vector)

{Cat, Dog}

A 15 question political poll of 139 residents on recent state legislation

139 ร— 15 matrix of answers (A-E)

Party affiliation: {Republican, Democrat, Independent}

Page 7: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Recall: Linear regression

Assume the generating function ๐‘“๐‘“(๏ฟฝ) is linear Write label ๐‘ฆ๐‘ฆ๐‘–๐‘– as a linear function of ๐‘‹๐‘‹: ๐‘ฆ๐‘ฆ๐‘–๐‘– = ๐’™๐’™๐‘–๐‘–๐’˜๐’˜ Matrix form: ๐’š๐’š = ๐‘‹๐‘‹๐’˜๐’˜

What should the ๐‘š๐‘š ร— 1 vector ๐’˜๐’˜ be? This is the familiar least squares regression:

๐’˜๐’˜ = ๐‘‹๐‘‹๐‘‡๐‘‡๐‘‹๐‘‹ โˆ’1๐‘‹๐‘‹๐‘‡๐‘‡๐’š๐’š We will set up the simplest neural network and show we arrive

at this same solution!

Page 8: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Declare a simple neural network

Recall ๐’™๐’™ is 1 ร— ๐‘š๐‘š One artificial neural unit Connects to each input ๐‘ฅ๐‘ฅ๐‘–๐‘– with a

weight ๐‘ค๐‘ค๐‘–๐‘– Produces one output ๐‘ง๐‘ง

๐‘ง๐‘ง = ๏ฟฝ๐‘–๐‘–

๐‘š๐‘š

๐‘ฅ๐‘ฅ๐‘–๐‘–๐‘ค๐‘ค๐‘–๐‘–http://nikhilbuduma.com

๐‘ง๐‘ง

ร—

Page 9: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Set an objective to learn

Want network outputs ๐‘ง๐‘ง๐‘–๐‘– to match labels ๐‘ฆ๐‘ฆ๐‘–๐‘– Choose a loss function ๐ธ๐ธ and optimize w.r.t the weights

๐ธ๐ธ =12๏ฟฝ๐‘–๐‘–

๐‘๐‘

๐‘ง๐‘ง๐‘–๐‘– โˆ’ ๐‘ฆ๐‘ฆ๐‘–๐‘– 2

๐ธ๐ธ =12๏ฟฝ๐‘–๐‘–

๐‘๐‘

๐’™๐’™๐‘–๐‘–๐’˜๐’˜ โˆ’ ๐‘ฆ๐‘ฆ๐‘–๐‘– 2

How to minimize ๐ธ๐ธ with respect to ๐’˜๐’˜?

Page 10: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Equivalence to least squares

Take the derivative and set it to zero:๐‘‘๐‘‘๐ธ๐ธ๐‘‘๐‘‘๐’˜๐’˜

= ๏ฟฝ๐‘–๐‘–

๐‘๐‘

๐’™๐’™๐‘–๐‘–๐’˜๐’˜ โˆ’ ๐‘ฆ๐‘ฆ๐‘–๐‘– ๐’™๐’™๐‘–๐‘–๐‘‡๐‘‡

๏ฟฝ๐‘–๐‘–

๐‘๐‘

๐’™๐’™๐‘–๐‘–๐‘‡๐‘‡๐’™๐’™๐‘–๐‘–๐’˜๐’˜ โˆ’ ๐’™๐’™๐‘–๐‘–๐‘‡๐‘‡๐‘ฆ๐‘ฆ๐‘–๐‘– = ๐ŸŽ๐ŸŽ

Written in matrix form this becomes:๐‘‹๐‘‹๐‘‡๐‘‡๐‘‹๐‘‹๐’˜๐’˜ โˆ’ ๐‘‹๐‘‹๐‘‡๐‘‡๐’š๐’š = ๐ŸŽ๐ŸŽ๐’˜๐’˜ = ๐‘‹๐‘‹๐‘‡๐‘‡๐‘‹๐‘‹ โˆ’1๐‘‹๐‘‹๐‘‡๐‘‡๐’š๐’š

Page 11: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Key idea: compose simple units

Where do we go from here? Use many of these simple units and

compose them in layers: Function composition: ๐‘”๐‘”(โ„Ž ๏ฟฝ )

Each layer learns a new representation of the data 3 layer network: ๐‘ง๐‘ง๐‘–๐‘– = โ„Ž3 โ„Ž2 โ„Ž1 ๐’™๐’™๐‘–๐‘–

http://neuralnetworksanddeeplearning.com

Page 12: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Drawback to only linear units

Recall our earlier assumption that ๐‘“๐‘“(๏ฟฝ) is linear This is a very restrictive assumption

Furthermore, composing strictly linear models is also linear!๐‘ง๐‘ง๐‘–๐‘– = โ„Ž3 โ„Ž2 โ„Ž1 ๐’™๐’™๐‘–๐‘– = ๐‘Š๐‘Š3๐‘Š๐‘Š2๐‘Š๐‘Š1๐’™๐’™๐‘–๐‘– = ๐‘Š๐‘Š123๐’™๐’™๐‘–๐‘–

XOR problem (Minsky, Papert 1969)

Page 13: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

XOR problem

Canโ€™t learn a simple XOR gate using only one straight line

Page 14: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Key idea: non-linear activations

Solution: add a non-linear function at the output of each layer What kind of function? Differentiable at least:

Hyperbolic tangent: ๐‘ง๐‘ง = tanh(๐’˜๐’˜๐‘‡๐‘‡๐’™๐’™๐‘–๐‘–)

Sigmoid: ๐‘ง๐‘ง = 1

1+๐‘’๐‘’โˆ’๐’˜๐’˜๐‘‡๐‘‡๐’™๐’™๐‘–๐‘–

Rectified Linear: ๐‘ง๐‘ง = max 0,๐’˜๐’˜๐‘‡๐‘‡๐’™๐’™๐‘–๐‘– Why? Labels ๐’š๐’š can be a non-linear function of the inputs (like

XOR)

Page 15: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Examples of non-linear activations

http://ufldl.stanford.edu

Page 16: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

How do we learn weights now?

With multiple layers and non-linear activation functions we canโ€™t simply take the derivative and set it to 0

Still can set a loss function and: Randomly try different weights Numerically estimate the derivative

๐‘“๐‘“โ€ฒ ๐‘ฅ๐‘ฅ =๐‘“๐‘“ ๐‘ฅ๐‘ฅ + โ„Ž โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ

โ„Ž Terribly inefficient and scale badly with the number of layersโ€ฆ

Page 17: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Key idea: gradient descent on loss function

Suppose we could calculate the partial derivative of ๐ธ๐ธ w.r.t each weight ๐‘ค๐‘ค๐‘–๐‘–:

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘ค๐‘ค๐‘–๐‘–

(gradient)

Decrease the loss function ๐ธ๐ธ by updating weights:

๐‘ค๐‘ค๐‘–๐‘– = ๐‘ค๐‘ค๐‘–๐‘– +๐›ฟ๐›ฟ๐ธ๐ธ๐›ฟ๐›ฟ๐‘ค๐‘ค๐‘–๐‘–

Repeatedly doing this process is called gradient descent

Leads to a set of weights that correspond to a local minimum of the loss function

Page 18: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Backpropagation to estimate gradients

One of the breakthroughs in neural network research Allows to calculate the gradients of the network! Core idea behind the algorithm is multiple applications of the

chain rule of derivatives:๐น๐น ๐‘ฅ๐‘ฅ = ๐‘“๐‘“ ๐‘”๐‘” ๐‘ฅ๐‘ฅ

๐น๐นโ€ฒ ๐‘ฅ๐‘ฅ = ๐‘“๐‘“โ€ฒ ๐‘”๐‘” ๐‘ฅ๐‘ฅ ๐‘”๐‘”โ€ฒ ๐‘ฅ๐‘ฅ Two passes through the network: forward and backward

Forward: calculate ๐‘”๐‘”(๐‘ฅ๐‘ฅ) and then ๐‘“๐‘“(๐‘”๐‘” ๐‘ฅ๐‘ฅ ) Backward: calculate ๐‘“๐‘“๐‘“(๐‘”๐‘” ๐‘ฅ๐‘ฅ ) and then ๐‘”๐‘”๐‘“(๐‘ฅ๐‘ฅ)

Page 19: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Multilayer Backpropagation

Assume we have ๐‘ก๐‘ก๐‘–๐‘– , ๐‘ก๐‘ก๐‘—๐‘— , ๐‘ง๐‘ง๐‘—๐‘— from the forward pass Work backward from the output of the network:

๐ธ๐ธ = 12โˆ‘๐‘—๐‘—โˆˆ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ๐‘œ ๐‘ก๐‘ก๐‘—๐‘— โˆ’ ๐‘ฆ๐‘ฆ๐‘—๐‘—

2, ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘—๐‘—

= โˆ’ ๐‘ก๐‘ก๐‘—๐‘— โˆ’ ๐‘ฆ๐‘ฆ๐‘—๐‘— (for output neurons)

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘–๐‘–

= โˆ‘๐‘—๐‘—๐‘‘๐‘‘๐‘ง๐‘ง๐‘—๐‘—๐‘‘๐‘‘๐‘œ๐‘œ๐‘–๐‘–

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘ง๐‘ง๐‘—๐‘—

= โˆ‘๐‘—๐‘—๐‘ค๐‘ค๐‘–๐‘–๐‘—๐‘—๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘ง๐‘ง๐‘—๐‘—

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘ง๐‘ง๐‘—๐‘—

=๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘—๐‘—๐›ฟ๐›ฟ๐‘ง๐‘ง๐‘—๐‘—

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘—๐‘—

= ๐‘ก๐‘ก๐‘—๐‘—๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘—๐‘—

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘–๐‘–

= โˆ‘๐‘—๐‘—๐‘ค๐‘ค๐‘–๐‘–๐‘—๐‘— ๐‘ก๐‘ก๐‘—๐‘—๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘—๐‘—

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘ค๐‘ค๐‘–๐‘–๐‘—๐‘—

=๐›ฟ๐›ฟ๐‘ง๐‘ง๐‘—๐‘—๐›ฟ๐›ฟ๐‘ค๐‘ค๐‘–๐‘–๐‘—๐‘—

๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘ง๐‘ง๐‘—๐‘—

= ๐‘ก๐‘ก๐‘–๐‘–๐‘ก๐‘ก๐‘—๐‘—๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐›ฟ๐‘œ๐‘œ๐‘—๐‘—

http://nikhilbuduma.com

Page 20: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Putting all the pieces together

3 key elements to understanding neural networks Composition of units with simple operations (dot-product) Non-linearity activation functions at unit outputs Learn weights using gradient descent

Using neural networks: Set up data matrix and label vector: ๐‘‹๐‘‹ and ๐‘ฆ๐‘ฆ Define a network architecture: number of layers, units per layer Choose a loss function to minimize: depends on the task

Page 21: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

A couple of demosโ€ฆ

Page 22: Recurrent Neural Networksย ยท General machine learning framework Data โ€“ ๐‘›๐‘›ร— ๐‘š๐‘šmatrix ๐‘‹๐‘‹ rows are observations ๐’™๐’™. ๐‘–๐‘– (1ร—๐‘š๐‘š) Data labels -๐‘›๐‘›ร—

RPI โ€“ Computer Science โ€“ Haidar Khan

Credits

Images from: http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/ http://ufldl.stanford.edu http://neuralnetworksanddeeplearning.com