ECML-2015 Presentation

Faster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training

ByAnirban Santara*, Debapriya Maji, DP Tejas, Pabitra

Mitra and Arobinda Gupta

Department of Computer Science and Engineering

Paper id: 6PDCKDD Workshop

Department of Computer Science and Engineering 7 September 2015

Page #2Faster learning of deep stacked autoencoders on multi-core systems using

synchronized layer-wise pre-training

IntroductionAutoencoder:• It is an artificial neural network used for unsupervised learning.• It consists of an encoder followed by a symmetrical decoder.• It learns to reconstruct the input with minimum amount of

deformation at the output of the decoder.

Deep Stacked Autoencoder:• It is an autoencoder with 3 or more hidden layers of neurons.• Learns representations of hierarchically increasing levels of

abstraction from the data.

Uses:• Efficient non-linear dimensionality reduction. E.g. Hinton 2006• Data-driven representation learning. E.g. Vincent 2009

Encoder Decoder




Training stacked autoencoder

Initialization: Greedy layer-wise unsupervised pre-training

using RBM, for example. (Bengio 2009, Hinton 2006)

Fine-tuning: Back-propagation over the entire network

(Hinton 1989)




Efforts at parallelizationData-level parallelism:• Calculations pertaining to different subsets of the data are

carried out at different processing nodes and the updates generated are averaged.

• Suitable for computing clusters as it requires little communication.

Network-level parallelism:• The neural network is partitioned (physically or logically)

and each part trains in parallel at a different computing node on the same whole dataset.

• Suitable for multi-core CPUs that allow fast inter-processor communication.

To the best of our knowledge, all existing methods of pre-training use a greedy layer-by-layer approach

Data

Network




A major drawback of greedy layer-wise pre-training

D1 D2 D3 D4

N1 epochs N2 epochs N3 epochs N4 epochs

Every layer Li waits idle for: All layers L1 through Li-1 before it

can start learning All the remaining layers, after it

has finished learning

The guiding philosophy of the proposed algorithm is to reduce the idle time of greedy layer-wise pre-training by introducing parallelism with synchronization




Proposed algorithm: Synchronized layer-wise pre-training

D1[0]

T1

T2

T3

T4

D2[1] D2[3] D2[k-1] D2[k]

D3[1] D3[3] D3[m-1] D3[m]

D4[1] D4[2] D4[n-1] D4[n]

time

Di[n]Data

for ith layer

in nth epoch

• Algorithm starts with T1 beginning to learn L1

• Ti waits until Ti-1 has completed one epoch of training

• Every time Ti completes one epoch, it transforms Di with the current weights and biases and modifies Di+1




Proposed algorithm: Synchronized layer-wise pre-training (contd.)

T1

T2

T3

T4

time

N2 epochs D2[N2+1]

N3 epochsD3N3+1]

N4 epochsD4N4+1]

N1 epochs• Every thread Ti

executes a specified Ni epochs of learning and goes to sleep

• If Ti-1 modifies Di after that, Ti wakes up, executes one epoch of learning and goes back to sleep

• The algorithm terminates when all the threads have finished their stipulated iterations




Experimental set-up

Experiment Pre-training Fine-tuning1. Benchmark with greedy layer-wise

pre-training20 epochs of greedy layer-wise pre-

training of each layer using RBM10 epochs of fine-tuning with

backpropagation over the entire architecture

1. Verification of proposed synchronized layer-wise pre-

training algorithm

Minimum 20 epochs (Ni=20) of synchronized layer-wise pre-training of

each layer using RBM

10 epochs of fine-tuning with backpropagation over the entire

architecture

• Problem: Dimensionality reduction of handwritten digits of MNIST dataset using deep stacked autoencoder using mean squared error to measure reconstruction accuracy.

• Architecture:

• Experiments:

Parameter Value

Depth 5

Layer dimensions 784, 1000, 500, 250, 30

Activation function sigmoid

Fig: sample digits from MNIST




Experimental set-up (contd.)• Parameters for the learning algorithms:

• System specifications:

Parameter Value

RBM (Contrastive Divergence)

Learning rate 0.1

Momentum 0.5 for the first 5 epochs and 0.9 afterwards

Backpropagation Learning rate 0.001

Parameter Value

Number of CPU cores 8

Main memory 8 GB




Results (convergence)Reconstruction error for training set

Reconstruction error for validation set

Anirban Santara




Results: curious behaviour of innermost layerReconstruction error for training set

Reconstruction error for validation set




Results: variation of overall reconstruction error on validation set

During pre-training

During fine-tuning




Comparison with greedy layer-wise pre-training

Algorithm Training error Test error

Greedy pre-training 8.00 8.19

Synchronized pre-training 8.39 8.57

• Average squared reconstruction error per digit:

• Execution times:

The proposed algorithm converges 1h 26min 49 sec faster which is a 26.17% speedup.

Algorithm Pre-training time Fine-tuning time

Greedy pre-training 3h 14min 43sec 2h 16min 59secs

Synchronized pre-training 1h 49min 11sec 2h 15min 42secs

Fig: Samples from MNIST dataset

Fig: Reconstructed digits from benchmark algorithmFig: Reconstructed digits from proposed algorithm




Summary• To reduce the idle time of greedy layer-wise pre-

training by introducing parallelism with synchronization

• The hidden layers start learning from immature training data.

• The training data is updated after every epoch of learning of the previous layer

• Convergence with performance at par with the benchmark (on MNIST dataset)

• 26.17% faster convergence observed using multiple cores of CPU

Motivation

Approach

Achievements

ECML-2015 Presentation

Software