Distributed simulation of Polychronous and plastic …...platforms (see arXiv:1310.8478, P. S. Paolucci et al.). A brain simulation benchmark has 3 points of interest: As a source

Perspectives of GPU computing in Physics and Astrophysics – 15 – 17 Sep 2014

Distributed simulation of Polychronous and plastic

Spiking Neural Networks: experiments with GPUs

Francesco Simula (INFN) for the APE Lab:

FP7 FET PROJECT GRANT N. 247846 2010-2014

www.euretile.eu

Large scale modeling of

neuro-synaptic activity and plasticity

■ The DPSNN-STDP is a Distributed simulator of

Polychronous Spiking Neural Nets with synaptic Spike-

Timing-Dependent Plasticity, an efficient C++ plus MPI

code developed by the APE lab of INFN to be used as

benchmark for development of specialized HW/SW

platforms (see arXiv:1310.8478, P. S. Paolucci et al.).

■ A brain simulation benchmark has 3 points of interest:

As a source of requirements and architectural inspiration

towards extreme parallelism

As a parallel/distributed coding challenge

As a scientific grand challenge


Where to start?


Paradigm

atic n

euro

n

Models of neural activity

at spiking abstraction level


Izhikevic model is:

■ computationally light:13 FLOPs / ms-neuron (physiologically accurate model needs ~1200 FLOPs / ms-neuron)

■ universal: same eq. for all known types of cortical neurons

■ has a rich dynamics: able to capture all 20 known neuron spiking patterns

Neural Spiking Model: the Izhikevich neuron


Summary of the neurocomputational properties of biological spiking neurons. The same model (Izhikevich – 2003) with different values of parameters a, b, c and d is able to reproduce the behaviour of several types of cortical neurons. Each horizontal bar corresponds to 20 ms.

Eugene M. Izhikevich – IEEE Trans. Neural Networks 15-5 (2004) pag. 1063-1070

Neural Spiking Model: the Izhikevich neuron


Eugene M. Izhikevich – IEEE Trans. Neural Networks 15-5 (2004) pag. 1063-1070

v(t) is the neural membrane potential; this is the key observable! – when v reaches vpeak, a neural spike is produced →

I(t) is the potential change generated by the sum of the currents from all synapses incoming to the neuron. It is a ‘forcing function’: incoming currents are present if spikes arrived form pre-synaptic neurons.

u(t) is an auxiliary variable (the recovery current bringing back v to equilibrium);

The dynamical variables of the single neuron are v(t) and u(t):

→ when a neuron spikes, all its M outgoing synapses add a current Wi to neurons they are connected to, with a set of different delays ti (polychronicity).

t= t0

A

B

D

C I(t0 + t1)=... +W1+...

I(t0 + tM)=... +WM+...

I(t0 + t2)=... +W2+...

1 2

M

W1

W2 WM

Synaptic dynamics: Spike Timing-Dependent

Plasticity (STDP)


Capturing Timing Dependent Causal/anti-Causal relationship between couples of neurons: Causal potentiation: the synapse is maximally potentiated if its signal arrives to the target just before the post-synaptic spike; Anticausal depression: the weight is maximally depressed if the signal arrives just late.

S. Song et al., Nature Neuroscience 3 (2000)

In DPSNN-STDP, all synaptic weight variations are accumulated over 1000 steps (timestep 1ms), then applied to the W’s (long term plasticity).

Distribution of Cortical Fields and

Cortical Modules among Software Processes


Spiking Activity and Synaptic Plasticity

(from 100K to 6.6 Giga synapses,

from 1 to 128 software processes)

■ The picture represents the evolution of a neural network computed by the DPSNN-STDP code

■ In the picture:

200 inhibitory neurons

800 excitatory neurons

total 100 000 synapses

Time resolution: 1ms (horizontal axis)

Each dot in the raster gram represents an individual spike

The evolution of the membrane potential of each neuron is simulated

The evolution of individual synaptic strength is computed (not shown in the picture)

Polychronism: individual synaptic delays are taken into account

Individual connections and neural types can be programmed


Emergent Biological Behaviour:

Spontaneous Evolution of Rythmic Activity

due to Polychronism and Synaptic Plasticity

■ As synaptic weights evolve according to STDP (synaptic spike-timing dependent

plasticity, initial delta frequency oscillations (2-4Hz @ first second activity)

dissolves for a while into uncorrelated Poissonian activity (activity @ 100s) and

then gamma frequency activity emerges (30-100Hz @ 3600s)

Delta rhythm @ first second uncorrelated @ 100s Gamma rhythm after 3600s


DPSNN-STDP: MPI version - Strong and

Weak Scaling

■ Strong scaling. From 1 to 128 cores @ 2.4 GHz simulate various total network sizes (from 51Msyn to 6.6Gsyn). Exec times normalized to synapse count.

■ Weak scaling for various local

network sizes. Exec time

normalized to synapse count.


From Program Flow and Profiling …


Function of the block Relative

execution time Note

Long term potentiation + after spike

dynamic (9.7 ±0.7)%

Gather +

computation

Barrier (optional) (29.9±6.1)% Workload

fluctuations

Communication: inter-process multicast:

Spikes dim (0.77±0.10)% Message passing

Communication: inter-process multicast:

Spikes payload (0.82±0.20)% Message passing

Axonal to synaptic spikes: intra-process

multicast (16.8±2.3)% Dereferencing

Add synaptic currents + long term

depression (19.2 ±2.7)% Computation

Thalamic input 0.01% Simplified model

Ordinary neural dynamics (11.8±1.4)% Computation

Rastergram & other statistical functions (1.9±0.1)% Computation

Long term synaptic plasticity (9.2±1.8)% Computation

These 2 functions have: - regular memory access patterns - significant amounts of FP computing ... how do they behave on the GPU?

GPU Environment

■ Trials were performed onto:

Intel Xeon CPUs:

• E5620 2.40GHz (Westmere)

• E5-2630 v2 2.60GHz (Ivy Bridge)

NVIDIA GPUs:

• S2050 (Fermi-class/sm_2.0, PCIe Gen2 int.)

• K20Xm (Kepler-class/sm_3.5, PCIe Gen2 int.)

• K40m (Kepler-class/sm_3.5, PCIe Gen3 int.)

• CUDA 6.5

Using the CUDA Thrust template library

version 1.8 on GitHub (with support for CUDA streams)

• Convert CPU arrays to Thrust device_vectors (with caveats!)

• Convert CPU functions to Thrust functors... and you are done!


Long Term Plasticity on GPUs

Example:

■ Connectivity: M(synapses per neuron) = 100

■ Using 131072 neu → 13107200 syn (24b/syn) → 300Mb

■ 9 bytes in/8 bytes out, 3 FLOPs + 2 MAX/MIN instr. (kernel

is bandwidth bound):


E5620

E5-2630v2

S2050 AoS

K20Xm AoS

K40m AoS

S2050 SoA

K20Xm SoA

K40m SoA

1core 63.48ms

1core 50.28ms

16.42ms 7.47ms 6.99ms 1.64ms 1.01ms 0.98ms 8core 17.5ms

8core 13.5ms

8core x2 11.2ms

8core x2 12.8ms

Long Term Plasticity on GPUs


Figures are VERY interesting, but there are caveats:

■ Best performance is obtained only when code is refactored

to keep ALL DATA on the GPU and use SoA instead of AoS

■ If the GPU is used only as accelerator of this kernel, data

needs transferring to/from the GPU: even with segmenting

the data sample to overlap transfers and computing, you

cannot go below the duration of a 300Mb cudaMemcpy:

51.6ms for the S2050

50.0ms for the K20Xm

29.9ms for the K40m

Neural dynamics on GPUs

Example: Using 131072 neu (16b/neu) → 2Mb

■ 32b in/8b out, 18 FLOPs + 1 MIN instr. (kernel is still

bandwidth bound):

... but seems to be a little volatile: too small sample?

■ Using a 10x sample → 1310720 neu (16b/neu) → 20Mb

... clearly better!

→ Anyway, same caveats of synaptic dynamics apply!


E5-2630v2 8core X2

S2050 AoS

K20Xm AoS

K40m AoS

S2050 SoA

K20Xm SoA

K40m SoA

~290µs 87µs ~62µs ~60µs 45µs ~30µs ~28µs

S2050 AoS

K20Xm AoS

K40m AoS

S2050 SoA

K20Xm SoA

K40m SoA

720µs 340µs 330µs 290µs 180µs 170µs

What can we say up to now?

■ The GPU can be 10/20 times faster of a well distributed

multicore CPU code on the arithmetic kernels

■ Best performance would require moving all code and data

onto the GPU; what about the rest of the application?

Elsewhere in the application there is a very large number

of random accesses to memory (list scans, sortings,

deeply nested dereferentiation) where the GPU is known

to suffer – hard to say beforehand if there is a net gain in

the end...

■ When moving to multi node, what happens with GPU-to-

GPU communications? → see the APEnet+ poster!


Conclusions

■ Key areas of present INFN activity on large scale neural modeling

Coding of scalable Parallel/Distributed simulator

• INFN developed the DPSNN-STDP simulator in the EURETILE FET

Project. Proven simulation up to 6.6G synapses, 128 cores.

• See arXiv:1310.8478 (Apr 2014)

Comparison with experimental neuro-biological data and

calibration of the INFN simulator

• Will be performed in the CORTICONIC FET project (starting from

Oct 2014, end Dec 2015) (cooperation with ISS, TUM, IDIBAPS)

Under course:

• Experiments about GPU porting

• Interface with experimental systems

• HW/SW co-design for brain simulations

• Inclusion of the simulator into robotic platforms

…open to partnership for future European Projects!


Distributed simulation of Polychronous and plastic …...platforms (see arXiv:1310.8478, P. S. Paolucci et al.). A brain simulation benchmark has 3 points of interest: As a source

Documents