Top Banner
learning from a novice perspective and recent innovations from KGP Anirban Santara Doctoral Research Fellow Department of CSE, IIT Kharagpur bit.do/AnirbanSantara
25
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep learning from a novice perspective

Deep learning from a novice perspective and recent innovations from KGPians

Anirban SantaraDoctoral Research Fellow

Department of CSE, IIT Kharagpurbit.do/AnirbanSantara

Page 2: Deep learning from a novice perspective

Deep Learning

Just a kind of

Machine Learning

Classification

Regression

Clustering

3 main tasks:

Page 3: Deep learning from a novice perspective

CLASSIFICATION

Pandas Dogs

Cats

? ??

Rather:

P(class| )?

Page 4: Deep learning from a novice perspective

REGRESSION

Independent variable (feature)

Dependent variable(target attribute)

Page 5: Deep learning from a novice perspective

CLUSTERING

Attribute 1

Attribute 2

Page 6: Deep learning from a novice perspective

The methodology:

1. Design a hypothesis function: h(y|x,ฮธ)

Target attribute Input Parameters of the learning machine

2. Keep improving the hypothesis until the prediction happens really good

Page 7: Deep learning from a novice perspective

Well, how bad is your hypothesis?In case of regressions:

A very common measure is mean squared error:

๐ธ= โˆ‘๐‘Ž ๐‘™๐‘™๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘–๐‘›๐‘”๐‘’๐‘ฅ๐‘Ž๐‘š๐‘๐‘™๐‘’๐‘ 

ยฟ ๐‘ฆ ๐‘‘๐‘’๐‘ ๐‘–๐‘Ÿ๐‘’๐‘‘โˆ’๐‘ฆ ๐‘Ž๐‘ ๐‘๐‘’๐‘Ÿ h h๐‘ฆ๐‘๐‘œ๐‘ก ๐‘’๐‘ ๐‘–๐‘ โˆจ  2

In classification problems: [10 ][01 ]

In one-hot classification frameworks, we often use mean square error

However, often we ask for the probabilities of occurrence of the different classes for a given input ( Pr(class|X) ). In that case we use K-L divergence between the observed (p(output classes)) and predicted (q(output classes)) distributions as the measure of error. This is sometimes referred to as the cross entropy error criterion.

๐พ๐ฟยฟ

Clustering uses a plethora of criteria like:โ€ข Entropy of a clusterโ€ข Maximum distance between 2

neighbors in a cluster

--and a lot more

Page 8: Deep learning from a novice perspective

Now its time to rectify the machine and improve$100,000

$50,000

Learning

We perform โ€œgradient descentโ€ along the โ€œerror-planeโ€ in the โ€œparameter spaceโ€:

โˆ† ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ=โˆ’learningrateโˆ—๐›ป๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ ๐‘“๐‘ข๐‘›๐‘๐‘ก๐‘–๐‘œ๐‘›

๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿโ†๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ +โˆ†๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ

Page 9: Deep learning from a novice perspective

Lets now look into a practical learning system: Artificial Neural Network

Cat

Dog

Panda

- A very small unit of computation

So the parameters of an ANN are:1. Incoming weights of every neuron2. Bias of every neuron

These are the ones that need to be tuned during learning

We perform gradient descent on these parameters

Backpropagation algorithm is a popular method of computing

Page 10: Deep learning from a novice perspective

Backpropagation algorithm

Input pattern vector

W21 W32

Forward propagate:

Error calculation:

Backward propagation:

If k output layer

If k hidden layer

Page 11: Deep learning from a novice perspective

Well after all, life is toughโ€ฆโ€ข The parameters of a neural network are generally initialized to random values.โ€ข Starting from these random values (with useless information)

it is very difficult (well not impossible, in fact time consuming) for backpropagation to arrive at the correct values of these parameters.

โ€ข Exponential activation functions like sigmoid and hyperbolic-tangent are traditionally used in artificial neurons. Thesefunctions have gradients that are prone to become zero in course of backpropagation.

โ€ข If the gradients in a layer get close to zero, they induce the gradients in the previous layers to vanish too. As a result the weights and biases in the lower layers remain immature.

โ€ข This phenomenon is called โ€œvanishing gradientโ€ problem in the literature.

These problems crop up very frequently in neural networks that contain a large number of hidden layers and way too many parameters (the so called Deep Neural Networks).

Page 12: Deep learning from a novice perspective

How to get around? Ans: Make โ€œinformedโ€ initializationโ€ข A signal is nothing but a set of random variables. โ€ข These random variables jointly take values from a probability distribution that is dependent on the nature of the

source of the signal.

E.g.: A blank 28x28 pixel array like can house numerous kinds of images. The set of 784 random variables assume values from a different joint probability distribution for every class of objects/scenes.

๐‘ƒ๐‘‘๐‘–๐‘”๐‘–๐‘ก (๐‘ฅ1 , ๐‘ฅ2 ,โ€ฆ, ๐‘ฅ784)

๐‘ƒh๐‘ข๐‘š๐‘Ž๐‘› ๐‘“๐‘Ž๐‘๐‘’ (๐‘ฅ1 ,๐‘ฅ2 ,โ€ฆ,๐‘ฅ784 )

Page 13: Deep learning from a novice perspective

Lets try and model the probability distribution of interest

Our target distribution: ๐‘ƒh๐‘ข๐‘š๐‘Ž๐‘› ๐‘“๐‘Ž๐‘๐‘’ (๐‘ฅ1 , ๐‘ฅ2 ,โ€ฆ,๐‘ฅ784 )We try to capture this distribution in a model that looks quite similar toa single layer neural network

The Restricted Boltzmann Machine: Itโ€™s a probabilistic graphical model (a special kind of Markov Random Field) that is capable of modelling a wide variety of probability distributions.

Capture the dependencies among the โ€œvisibleโ€variables

Page 14: Deep learning from a novice perspective

The working of RBMParameters of the RBM:1. Weights on the edges 2. Biases on each node and

Using these we define a joint probability distribution over the โ€œvisibleโ€ variables and the โ€œhiddenโ€ variables as:

Where the energy function is defined as:

And Z is a normalization term called the โ€œPartition functionโ€

๐‘ƒ ๐‘…๐ต๐‘€ (๐’— ,๐’‰)= 1๐‘๐‘’๐ธ (๐’— ,๐’‰ )

๐‘ƒh๐‘ข๐‘š๐‘Ž๐‘› ๐‘“๐‘Ž๐‘๐‘’ (๐‘ฃ1 ,๐‘ฃ2,โ€ฆ,๐‘ฃ784 )

โˆ‘๐’‰

๐‘ƒ๐‘…๐ต๐‘€ (๐’— ,๐’‰ )

๐‘ƒ ๐‘…๐ต๐‘€ (๐‘ฃ1 ,๐‘ฃ2 ,โ€ฆ ,๐‘ฃ784 )

๐พ๐ฟยฟยฟโˆ’๐ป (๐‘ƒh๐‘ข๐‘š๐‘Ž๐‘› ๐‘“๐‘Ž๐‘๐‘’)โˆ’ โˆ‘

๐‘ฃ1 ,๐‘ฃ2 ,โ€ฆ,๐‘ฃ784

๐‘ƒh๐‘ข๐‘š๐‘Ž๐‘› ๐‘“๐‘Ž๐‘๐‘’ (๐‘ฃ1 ,โ€ฆ ,๐‘ฃ784 ) ๐‘™๐‘›๐‘ƒ ๐‘…๐ต๐‘€ (๐‘ฃ1 ,โ€ฆ,๐‘ฃ784 )

Empirical average of the log-likelihood of data under the model distributionNot under our control

MAXIMIZE

Page 15: Deep learning from a novice perspective

Layer-wise pre-training using RBMโ€ข Every hidden layer is pre-trained

as the hidden layer of a RBM

As RBM models the statistics of the input, the weights and biases carry meaningful information about the input. Use of these as initial values of the parameters of a deep neural network has shown phenomenal improvement over random initialization both in terms of time complexity and performance.

โ€ข This is followed by fine-tuning over the entire network via back-propagation

Page 16: Deep learning from a novice perspective

โ€ข Autoencoder is a neural network operating in unsupervised learning mode

โ€ข The output and the input are set equal to each otherโ€ข Learns an identity mapping from the input to the outputโ€ข Applications:

โ€ข Dimensionality reduction (Efficient, non-linear)โ€ข Representation learning (discovering interesting structures)โ€ข Alternative to RBM for layer-wise pre-training of DNN.

The Autoencoder

A deep stacked autoencoder

Page 17: Deep learning from a novice perspective

So deep learning โ‰ˆ training โ€œdeepโ€ neural networks with many hidden layers

Step 1: Unsupervised layer-wise pre-training Step 2: Supervised fine-tuning

- This is pretty much all about how deep learning works. However there is a class of deep networks called convolutional neural networks that often do not need pre-training because these networks use extensive parameter sharing and use rectified linear activation functions.

Well, deep learning when viewed from a different perspective looks really amazing!!!

Page 18: Deep learning from a novice perspective

Traditional machine learning v.s. deep learning

Data

Hand-engineering of feature extractors

Dataโ€“driven target-oriented representation learning

Data representations by feature extractors

โ€ข Classificationโ€ข Regressionโ€ข Clusteringโ€ข Efficient

coding

Inference engine

Traditional machine learning

Deep Learning

Page 19: Deep learning from a novice perspective

Whatโ€™s so special about it?Traditional machine learning Deep learning

โ€ข Designing feature detectors requires careful engineering and considerable domain expertise

โ€ข Representations must be selective to aspects of data that are important for our task and invariant to the irrelevant aspects (selectivity-invariance dilemma)

โ€ข Abstractions of hierarchically increasing complexity are learnt by a data driven approach using general purpose learning procedures

โ€ข A composition of simple non-linear modules can learn very complex functions

โ€ข Cost functions specific to the problem amplify aspects of the input that are important for the task and suppress irrelevant variations

Page 20: Deep learning from a novice perspective

Pretty much how we humans go about analyzingโ€ฆ

Page 21: Deep learning from a novice perspective

Some deep architectures:-

Deep stacked autoencoder

Deep convolutional neural network

Recurrent neural network

Used for efficient non-linear dimensionality reduction and discovering salient underlying structures in data

Exploits stationarity of natural data and uses the concept of parameter sharing to study large images, long spoken/ written strings to make inferences from them

Custom made for modelling dynamic systems and find use in natural language (speech and text) processing, machine translation, etc.

Page 22: Deep learning from a novice perspective

Classical automatic speech recognition system

Viterbi beam

search / A*

decoding

N-best sentences or word lattice

Rescoring

FINAL UTTERRENCE

Acoustic model generationSentence model preparation

Phonetic utterance models

Sentence model

Signal acquisition

Feature extraction

Acoustic modelling

Page 23: Deep learning from a novice perspective

Some of our works:-

2015:

Deep neural network and Random Forest hybrid architecture for learning to detect retinal vessels in fundus images (accepted at EMBC-2015, Milan, Italy)

Our architecture:

Average accuracy of detection: 93.27%

2014-15:

Faster learning of deep stacked autoencoders on multi-core systems through synchronized layer-wise pre-training (accepted at PDCKDD Workshop, a part of ECML-PKDD 2015, Porto, Portugal)

Conventional serial pre-training:

Proposed algorithm:

26% speedup for compression of MNIST handwritten digits

Page 24: Deep learning from a novice perspective

Take-home messagesโ€ข Deep learning is a set of algorithms that have been designed to

1. Train neural networks with a large number of hidden layers.2. Learn features of hierarchically increasing complexity in a data and objective โ€“ driven method.

โ€ข Deep neural networks are breaking all world records in AI because it can be proved that they have the capacity of modelling highly non-linear functions of the data with fewer parameters than shallow networks.

โ€ข Deep learning is extremely interesting and a breeze to implement once the underlying philosophies are understood. It has great potential of being used in a lot of ongoing projects at KGP.

If you are interested to go deep into deep learningโ€ฆ

Take Andrew Ngโ€™s Machine Learning

course on Coursera

Visit ufldl.Stanford.edu

and read the entire tutorial

Read LeCunโ€™s latest deep learning

review published in Nature

Page 25: Deep learning from a novice perspective

Thank you so much

Please give me some feedback for this talk by visiting: bit.do/RateAnirban Or just scan the QR code