CS489/698: Intro to ML - University of Waterlooy328yu/mycourses/489/lectures/lec11_DNN.pdfMulticlass Classification Output Layer Activation Function Loss Function: (Categorical Crossentropy)

CS489/698: Intro to ML Lecture 11: Fundamentals of Deep Learning

Outline

•  Loss Functions •  Regularization

10/31/17 Agastya Kalra 2

Outline

Supervised Learning

maximize or minimize

4 10/31/17 Agastya Kalra

Deriving Cost Functions - Regression

Hidden Layers Output Layer Conditional Probability

Maximum Likelihood

Mean Squared Error Properties

•  Can have exploding gradients •  Many shallow valleys, tricky to optimize •  Trick: Normalize outputs to be between 0-1 so

gradient cannot be greater than 1 •  Not desired

Binary Classification

y = 0 or 1

Binary Crossentropy

Derive with Binomial instead of Gaussian 10/31/17 Agastya Kalra

Multiclass Classification

Output Layer Activation Function Loss Function: (Categorical Crossentropy)

9 Derived from multinomial distribution

10/31/17 Agastya Kalra

Investigating Softmax

Log Softmax Numerically stable softmax

No gradient saturation

Loss Function Cheat Sheet

Outline

Bias vs Variance

Regularization

•  Any modification made to the learning algorithm that is intended to reduce generalization but not its training error

•  Take high capacity model, increase bias in a “good” direction and reduce variance

Weight Penalties

Cost function is of the form: is some penalty of the weights is a constant giving the penalty a weight

Can also be applied to activations

L2 Penalty

•  •  Favours many low weights. •  It wants small changes in input to have minimal effect

on output. if w is a vector of size 4, then it favours [0.25, 0.25, 0.25, 0.25] vs [0,0,0,1]

L2 Gradient

Loss Fn: Gradient: Update: Pushing down weights based on size!

Effect of L2 Regularization

L1 Penalty

Equal Gradient everywhere! Encourages sparsity

Data Augmentation

•  easy to implement •  dramatically reduced generalization error on smaller

datasets

Injecting Noise in Outputs

•  Many datasets can have wrong labels •  Maximizing can be harmful when y is wrong •  Label Smoothing: if label has prob of being right,

set true class to and false classes to •  dramatically reduces cost of extremely incorrect

examples

Multi-Task Training

•  Train a model to do multiple tasks from the same input •  Better generalizations

Early Stopping

•  Run learning for a fixed number of iterations •  Every few iterations, check validation performance •  Save the model with best validation performance •  Implicit hyperparameter search across time

Ensembles

•  Train multiple models and have them weighted vote •  Resample dataset, Re-initialize weights •  Multiple models will have some uncorrelated errors •  Free 2% •  Expensive •  Recycle models

by checkpointing and increasing LR

Dropout - Train Time

•  Multiply activations by binary mask •  Dropout Probability: The chance an activation will be

turned off. •  0.5 for hidden layers •  0.2 for input

Dropout Test Time

•  No Binary Mask •  Multiply all weights by dropout probability

•  Approximately taking the average of all subnetworks •  Inverted dropout divides by dropout probability at train

time instead

Dropout - Why??

•  Simultaneously Trains and ensemble of sub-networks •  Forces Redundancy in Model

Dropout - Code

28 http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture6.pdf

Inverse Dropout - Code

Batch Normalization

•  L2 Weight Decay: Forces low weights, helps convergence (must do)

•  L1 Weight Decay: Forces sparsity (optional) •  Data Augmentation: Increases dataset size (large

gains, must do) •  Noisy Outputs: Reduce effect of bad examples

(optional) •  Multi-Task training: Effective, but hard •  Early Stopping: Prevents overfitting (must do)

Regularization Summary

Regularization Summary cont’d

•  Ensembles: Free 2%, but expensive •  Dropout: Cheap ensemble, (optional) •  Batch Normalization: Better gradient flow, (must do)

CS489/698: Intro to ML - University of Waterlooy328yu/mycourses/489/lectures/lec11_DNN.pdfMulticlass Classification Output Layer Activation Function Loss Function: (Categorical Crossentropy)

Documents

Technical reports - MyCourses

17 - Flotation of Gold Ores - MyCourses

EMM Procedure 7. Cell Reselection without TAU - MyCourses

ARTISTIC RESEARCH METHODOLOGY - MyCourses

IN N O V A TIO N LE A D E R S H IP - MyCourses

Session’on’Parﬁton’naturalism’msmith/mycourses/Par...

Introduction - دانشگاه آزاد اسلامی...

DC and stepper motors - MyCourses

MSA University Faculty of Computer Science Programme...

Designing an Inventory Database System CS489 Research...

MyCourses Tutorial - Harvard University · 2002-10-15 ·.....

Human-Centred Research and Design in Crisis - MyCourses

Analysis of popular equalizers - MyCourses

A Question… CS489/589: Access Control & System...

Sähkötekniikka ja elektroniikka - MyCourses · PDF...

MyKukka...