Deep learning: what? how? why? How to win a Kaggle competition
Post on 16-Apr-2017
268 Views
Preview:
Transcript
PowerPoint Presentation
Deep LearningWhat, how, why? On the use of deep learning for Kaggle competitions
In about 45 minutesZMUV
In about 30 minutesBigger & DeeperIs better
In about 15 minutes
Who would if he had the time?
Who knows convolutional neural networks?Who works inMachine learning?Who has worked on Kaggle problems?Who has workedwith deep learning before?
Who know what Kaggle is?Who know what dropout is?
Who am IJonas DegravePhd studentUGent
Who are we?former Reservoir Lab
Data Science LabIDLab
Research in neural networks since 2005
HIPSTERALERT
What do we do?Machine learningRoboticsBrain-inspired computing
What did we do?
Totalling $160k in prizes
Testimonials
Neural networks in 5 minutes
Neural networks in 5 minutes
Input layerHidden layersOutput layer
Gradient descent
Backpropagation
Deep learning
Input layerHidden layerOutput layerHidden layerHidden layer
HistoryArtificial Neural Net: 1949Backpropagation: 1975
Deep Learning: 2012
What used to be the problem
Input layerHidden layerOutput layerHidden layerHidden layer
Vanishing gradientsAnd all information is gone
For long,we didnt know
GPUsRectifiersMaxpoolDropout
They fight vanishing gradients!What has changed
Deep learning
Deep learningState of the art for all problems with spatially correlated data
No more feature engineering!
Old school bingoBoltzmann MachinesEnergyTanh or sigmoid activationFeature engineeringDeep belief networks
How to domodernneural networks
Train setMake the sets, make them wellValidation set & Test set
Choose your error functionAlways optimize the error function where possible!
Use error function for your problem
ErrorValidationTrainingTimeValidationTrainingMake Train & validation curves
Underfitting & overfitting
ValidationTrainingTimeUnderfitting & overfitting
RegularizeBigger & Deeper
Larger networks tend to work better. Make your network bigger and bigger until the accuracy stops increasing. Then regularize the hell out of it. Then make it bigger still. Yoshua Bengio
My first architecture
Start with standard components:Conv-layers, dense layers,max-pooling, dropout
SparsityMake sure, that for each sample, only a few parameters are used
Dropout
Maxpool
yx
Rectifier(aka Relu)
Convolution layers
No bigger than 3x33x3 layer9 parameters3x3 receptive field5x5 layer25 parameters5x5 receptive field2 stacked 3x3 layer19 parameters5x5 receptive field
Output functionSoftmaxLoglossCross entropySigmoidRegressionIdentity
My first architecture
~ 1 million parameters
Let usoptimize
Gradient Descent
Train set
Gradient
Stochastic Gradient Descent
Train set
Gradient
Batch
Adams update rule
Train set
Gradient
Batch
Gradient
Gradient
Gradient
Weight Update Step
Local minimum
You want generalization, not the global minimum on the train set!
My first architecture
~ 1 million parameters
Initialization
Weight matricesRandom orthogonal initializationWith correct amplitude
Most libraries provide this
Does not lose information
Output layerYou have prior information!Initialize with zeros!
Bias
Bias sets the initial sparsity!
Think about your initialization!
My first architecture
~ 1 million parameters
DoesTrain on 1 sampleitwork?Train on 2 samples
Learning rate
OvershootingLearn too slow
Learning rate
Data preprocessingZMUV your data
ZeromeanUnit Variance
Input layerHidden layerOutput layerHidden layerHidden layerBatch normalizationbatchnormbatchnormbatchnormbatchnorm
Regularization
Data augmentation
Unsupervised learningLearn on the test setPseudo-labelingLadder networks
Insert a priori informationinto architecture
Insert a priori information
Its an art
RegularizeBigger & Deeper
Rinse and repeat Jonas Degrave
Ensemble
The average prediction will always be better than the worst prediction.
Ensemble OptimizedOnValidation set
Submit
Computing timeDeadlines are fixed
End performance is proportional to number of iterations,NOT training time per model
Major take-awaysEverything has a reasonDont buy into hypesIf it cant be explained in 1 minute why it works, it probably isnt working.
Skip connections
zeros
Wide convolutions
Thanks!Questions: @317070
top related