Faster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training By Anirban Santara*, Debapriya Maji, DP Tejas, Pabitra Mitra and Arobinda Gupta Department of Computer Science and Engineering Paper id: 6 PDCKDD Workshop
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Faster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training
Department of Computer Science and Engineering 7 September 2015
Page #2Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
IntroductionAutoencoder:• It is an artificial neural network used for unsupervised learning.• It consists of an encoder followed by a symmetrical decoder.• It learns to reconstruct the input with minimum amount of
deformation at the output of the decoder.
Deep Stacked Autoencoder:• It is an autoencoder with 3 or more hidden layers of neurons.• Learns representations of hierarchically increasing levels of
abstraction from the data.
Uses:• Efficient non-linear dimensionality reduction. E.g. Hinton 2006• Data-driven representation learning. E.g. Vincent 2009
Encoder Decoder
Department of Computer Science and Engineering 7 September 2015
Page #3Faster learning of deep stacked autoencoders on multi-core systems using
using RBM, for example. (Bengio 2009, Hinton 2006)
Fine-tuning: Back-propagation over the entire network
(Hinton 1989)
Department of Computer Science and Engineering 7 September 2015
Page #4Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Efforts at parallelizationData-level parallelism:• Calculations pertaining to different subsets of the data are
carried out at different processing nodes and the updates generated are averaged.
• Suitable for computing clusters as it requires little communication.
Network-level parallelism:• The neural network is partitioned (physically or logically)
and each part trains in parallel at a different computing node on the same whole dataset.
• Suitable for multi-core CPUs that allow fast inter-processor communication.
To the best of our knowledge, all existing methods of pre-training use a greedy layer-by-layer approach
Data
Network
Department of Computer Science and Engineering 7 September 2015
Page #5Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
A major drawback of greedy layer-wise pre-training
D1 D2 D3 D4
N1 epochs N2 epochs N3 epochs N4 epochs
Every layer Li waits idle for: All layers L1 through Li-1 before it
can start learning All the remaining layers, after it
has finished learning
The guiding philosophy of the proposed algorithm is to reduce the idle time of greedy layer-wise pre-training by introducing parallelism with synchronization
Department of Computer Science and Engineering 7 September 2015
Page #6Faster learning of deep stacked autoencoders on multi-core systems using
executes a specified Ni epochs of learning and goes to sleep
• If Ti-1 modifies Di after that, Ti wakes up, executes one epoch of learning and goes back to sleep
• The algorithm terminates when all the threads have finished their stipulated iterations
Department of Computer Science and Engineering 7 September 2015
Page #8Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Experimental set-up
Experiment Pre-training Fine-tuning1. Benchmark with greedy layer-wise
pre-training20 epochs of greedy layer-wise pre-
training of each layer using RBM10 epochs of fine-tuning with
backpropagation over the entire architecture
1. Verification of proposed synchronized layer-wise pre-
training algorithm
Minimum 20 epochs (Ni=20) of synchronized layer-wise pre-training of
each layer using RBM
10 epochs of fine-tuning with backpropagation over the entire
architecture
• Problem: Dimensionality reduction of handwritten digits of MNIST dataset using deep stacked autoencoder using mean squared error to measure reconstruction accuracy.
• Architecture:
• Experiments:
Parameter Value
Depth 5
Layer dimensions 784, 1000, 500, 250, 30
Activation function sigmoid
Fig: sample digits from MNIST
Department of Computer Science and Engineering 7 September 2015
Page #9Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Experimental set-up (contd.)• Parameters for the learning algorithms:
• System specifications:
Parameter Value
RBM (Contrastive Divergence)
Learning rate 0.1
Momentum 0.5 for the first 5 epochs and 0.9 afterwards
Backpropagation Learning rate 0.001
Parameter Value
Number of CPU cores 8
Main memory 8 GB
Department of Computer Science and Engineering 7 September 2015
Page #10Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Results (convergence)Reconstruction error for training set
Reconstruction error for validation set
Anirban Santara
Department of Computer Science and Engineering 7 September 2015
Page #11Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Results: curious behaviour of innermost layerReconstruction error for training set
Reconstruction error for validation set
Department of Computer Science and Engineering 7 September 2015
Page #12Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Results: variation of overall reconstruction error on validation set
During pre-training
During fine-tuning
Department of Computer Science and Engineering 7 September 2015
Page #13Faster learning of deep stacked autoencoders on multi-core systems using
synchronized layer-wise pre-training
Comparison with greedy layer-wise pre-training
Algorithm Training error Test error
Greedy pre-training 8.00 8.19
Synchronized pre-training 8.39 8.57
• Average squared reconstruction error per digit:
• Execution times:
The proposed algorithm converges 1h 26min 49 sec faster which is a 26.17% speedup.