Data Mining - Neural Networks - univ-angers.frricher/dm/data_mining_6_ann.pdf · Dr. Jean-Michel RICHER Data Mining - Neural Networks 13 / 79. ANN Paul Werbos, 1974 in 1974 developed

Data Mining - Neural Networks

Dr. Jean-Michel RICHER

2018jean-michel.richer@univ-angers.fr

Dr. Jean-Michel RICHER Data Mining - Neural Networks 1 / 79

Outline

1. Introduction

2. History and working principle

3. Improvements of NN

4. How to learn with a NN ?

5. Backpropagation example

6. Interesting links and applications

1. Introduction

What we will cover

What we will coverbasics of Artificial Neural Networksthe perceptronthe multi-layer networkthe sigmoid functionbackpropagationSynaptic.js

2. History and workingprinciple

Artificial Neural NetworksNNs, ANNs or Connectionist Systems are computingsystems inspired by the biological neural networks thatconstitute animal brainsbased on a collection of connected units or nodescalled artificial neuronsthey try to model how neurons in the brain functionsuch systems learn or progressively improve theirperformance by considering examples (trainingphase)

Note: strong and weak AI, intelligence = calculation ?

Specific Artificial Neural Networksfor image recognition: Convolutional Neural Network(CNN or ConvNet), a variation of multilayerperceptrons designed to require minimalpreprocessingfor speech recognition: Time Delay Neural Network(TDNN)

what can you do with NN ?

A first example: MNISTthe MNIST database of handwritten digits of 28× 28pixels784 inputs and 10 outputsdatabase of 60.000 examples and a test set of 10.000smallest error rate of 0.35% with 6-layers NN (Ciresanet al., 2010)smallest error rate of 0.23% with ConvolutionalNetwork (Ciresan et al., 2012)

McCulloch and Pitts, 1943Warren S. McCulloch, a neuroscientist, and WalterPitts, a logician explain the complex decisionprocesses in a brain using a linear threshold gatetakes a sum and returns 0 if the result is below thethreshold and 1 otherwise.very simple: binary inputs and outputs, threshold stepactivation function, no weighting of inputs

Donald O. Hebb, 1949Hebbian rule basis of nearly all neural learningproceduresconnection between two neurons is strengthenedwhen both neurons are active at the same timethis change in strength is proportional to the productof the two activitiesuse weights

Rosenblatt, 1958Frank Rosenblatt, a psychologist at Cornell, wasworking on understanding the comparatively simplerdecision systems present in the eye of a fly, whichunderlie and determine its flee response.he proposed the idea of a Perceptron (Mark IPerceptron)an algorithm for pattern recognitionsimple input output relationship, modeled on aMcCulloch-Pittsperceptron learning: weights are adjusted only whena pattern is misclassified

Bernard Widrow, Marcian E. Hoff, 1960professor Widrow and his student Hoff introduced theADALINE (ADAptive LInear NEuron)a fast and precise adaptive learning system: leastmean squares filter (LMS)delta rule: minimises the output error using(approximate) gradient descentfound in nearly every analog telephone for real-timeadaptive echo filtering

Note: Hoff received his master’s degree from Stanford Uni-versity in 1959 and his PhD in 1962, father of the micropro-cessor at Intel

Minsky and Papert, 1969Marvin Minsky and Seymour Papert led a campaignto discredit neural network researchall neural networks suffer from the same fatal flaw asthe perceptron (XOR)they left the impression that neural network researchwas a dead end

Note: Minsky (MIT) is known for co-founding the field of AI,Papert (MIT) developed the Logo programming language

Paul Werbos, 1974in 1974 developed the back-propagation learningmethod although its importance wasn’t fullyappreciated until a 1986accelerates the training of multi-layer networksinput vector is applied to the network andpropagated forward from the input layer to thehidden layer, and then to the output layeran error value is then calculated by using the desiredoutput and the actual output for each output neuronin the network.the error value is propagated backward through theweights of the network beginning with the outputneurons through the hidden layer and to the inputlayer

Geoffrey Hinton, David Rumelhart, Ronald Williams ,1986

Backpropagation: repeatedly adjust the weights soas to minimize the difference between actual outputand desired outputHidden Layers: neuron nodes stacked in betweeninputs and outputs, allow NN to learn morecomplicated features (such as XOR logic)

Multi Layer NN

Figure: from the course of Nahua Kang ontowardsdatascience.com

Deep LearningDeep Learning is about constructing machinelearning models that learn a hierarchicalrepresentation of the dataNeural Networks are a class of machine learningalgorithmsexample: NVIDIA CUDA Deep Neural Network library(cuDNN) is a GPU-accelerated library of primitives fordeep neural networks.

ANN working principle

The Artificial Neuronconnected with n input channels x1 to xn

each has a synaptic weight w1 to wn

there is a bias b

use an activation function fa

The output is defined as:

y = fa(n∑

xi ×wi + b)

ANN principle

Neuron formulaCan be modified by incorporating the bias into the xi ×wi

set x0 = 1w0 = b

the formula becomes:

y = fa

xi ×wi

The perceptron

Neuron

Figure: from the course of Nahua Kang ontowardsdatascience.com

The perceptron

Neuron activationThe heaviside (thresold or binary) function is of the form

∑ni=0 wi × xi > 0

0 otherwise

The perceptron is a simple model of prediction.

The perceptron

Learn with perceptron

initialize w ;while not convergence do

compute errors;update w from errors;

endAlgorithm 1: Perceptron learning scheme

wj = wj + η(y − y)× xj

where η is the learning constant (not too big, not too small)between 0.05 and 0.15

The perceptron

AND / ORThe perceptron can implement boolean formulas like theboolean OR or the AND

a b a ∨ b a ∧ b0 0 0 00 1 1 01 0 1 01 0 1 1

The perceptron

Exemple with AND

x0 x1 x21 0 01 0 11 1 01 1 1

η = 0.1w = [0.1, 0.2, 0.05]

The perceptron

Exemple with AND - first casetake X0 = [1, 0, 0], y0 = 0∑

wi × xi = 0.1× 1 + 0.2× 0 + 0.05× 0 = 0.1fa(0.1) = 1 = y

w0 = w0 + η(y − y)× X00 = 0.1 + 0.1× (0− 1)× 1 = 0

w1 = w1 + η(y − y)× X10 = 0.2 + 0.1× (0− 1)× 0 = 0.2

w2 = w2 + η(y − y)× X20 = 0.05+ 0.1× (0− 1)× 0 = 0.05

continue with X1 = [1, 0, 1], ...

The perceptron

Exemple with AND - Convergencew = [−0.30000001, 0.22, 0.10500001]result is

x0 x1 x2 yp1. 0. 0. 01. 0. 1. 01. 1. 0. 01. 1. 1. 1

It works !

The perceptron

ExerciseTry to implement the perceptron in python, C++ orJavaand test it for the boolean AND and OR

The perceptron

Why XOR is not possible with a perceptron ? (1/2)The one layer perceptron acts as a linear separator:

AND OR XOR0 1

The perceptron

Why XOR is not possible with a perceptron ? (2/2)

a b a XOR b equation0 0 0 w0 + 0×w1 + 0×w2 ≤ 0 (1)0 1 1 w0 + 0×w1 + 1×w2 > 0 (2)1 0 1 w0 + 1×w1 + 0×w2 > 0 (3)1 1 0 w0 + 1×w1 + 1×w2 ≤ 0 (4)

adding (1) and (4) and then (2) and (3):

(1) + (4) 2w0 + w1 + w2 ≤ 0(2) + (3) 2w0 + w1 + w2 > 0

impossible !

3. Improvements of NeuralNetworks

Other activation functions

Heaviside problemIf the activation function is linear then the final output is stilla linear combination of the input data

SigmoidA sigmoid function is a real function (special case of thelogistic function):

bounded (min, max)differentiablehas a characteristic "S"-shaped curve

s(x) = σ(x) =1

1 + e−x =ex

1 + ex

The sigmoid function

Other sigmoid-like functions

hyperbolic tangent: tanh(x) = ex−e−x

ex+e−x

arctangent function: arctan(x)

error function: 2√π

∫ x0 e−t2

See wikipedia for a complete list

The sigmoid function

Characteristic features of the sigmoid function

Properties of the sigmoid functionoutput values range from 0 to 1the curve crosses 0.5 at x = 0simple derivative of s(x)× (1− s(x))

used for models where you have to predictprobability of an output

See math.stackexchange.com for demonstration of thederivative

4. How to learn with aNeural Network ?

The backpropagation

Notationswe will use L to refer to a layer

yL represents the output of layer L

xL−1 represents the input layer for the computation ofyL

wL is the vector of weightsthe output is then computed by

yL = σ(wL × xL−1 + bL)

where σ is the sigmoid activation function and bL is the bias

The backpropagation

zL−1

σ(zL)

yL−1 yL

σ(zL−1) wL,bL

To simplify understanding we will write:

yL = σ(zL)

withzL = wL × xL−1 + bL

The backpropagation

Imagine you want to build a NN to implement the XORfunction using a hidden layer:

L0 L1 L2

y1 y2x1

Input layer Hidden layer Output layer

we propagate the input values to the output layer:

y1 = σ(w1 × x0 + b1)

y2 = σ(w2 × y1 + b2)

We then can compare y2 to the expected value yexp

The backpropagation

Error function and gradientIf y2 and yexp (the expected value for the output) are differ-ent we need to modify the wi and bi , for this we computethe error as:

E(y2) =12(yexp − y2)

which in fact results from:

E(y2) =12(yexp − σ(W2 × σ(W1 × x0 + b1) + b2)

and in fact the error depends of w1, b1, w2, b2

The backpropagation

Error function and gradientWe will use the gradient of E to determine the influence ofthe wL’s and the bias bL’s:

∇E =

(∂E∂wL

,∂E∂bL

)+∇E is the direction to increase the function−∇E is the direction to decrease the function

The backpropagation

How to compute ∂E∂wL

Remember that

zL = WL × yL−1 + bLyL = σ(zL)

E = 12(yexp − yL)

So the derivative of E with respect to wL can be rewritten:

∂E∂wL

(∂E∂yL

)(∂yL

)(∂zL

The backpropagation

How to compute ∂E∂wL

Remember that

∂E∂yL

= 12 × 2× (yexp − yL)×−1

∂yL∂zL

= σ′(zL)∂zL∂wL

= yL−1

So the derivative of E with respect to wL is:

∂E∂wL

= −× (yexp − yL)× σ′(zL)× yL−1

The backpropagation

What about ∂E∂bL

Following the same demonstration, we get:

∂E∂bL

= −× (yexp − yL)× σ′(zL)

The backpropagation

Last stepwe have to sum all errors for each input datathen propagate the change to the previous layerusing the gradient:L2_delta = L2_error * sigmoid_deriv(L2)

L1_error = L2_delta.dot(w2.T)

L1_delta = L1_error * sigmoid_deriv(L1)

w2 += L1.T.dot(L2_delta) * eta

w1 += L0.T.dot(L1_delta) * eta

The backpropagation

How to design a NN ?

Design of Neural Network1 collect data (data structure ?)2 normalize data3 define training sets (Fold technique)4 define a test set (or use one of the folds)5 train the network using backpropagation6 test result

Follow the tutorial of Jason Brownlee on the net calledHow to Implement the Backpropagation Algorithm FromScratch In Python, November 2016.

Backpropagation example

expectedvalues

0.35 0.60

0.15 0.40

y12 = σ(z1

y22 = σ(z2

0.450.20

0.500.25

y23 = σ(z2

y13 = σ(z1

0.30 0.55

define W2 and W3 as matrices:

0.15 0.2

0.25 0.3

w1,12 w1,2

w2,12 w2,2

0.4 0.45

0.5 0.55

w1,13 w1,2

w2,13 w2,2

Propagate values of x11 and x2

1 by computing z12 and y1

z12 = b2 + w1,1

2 × x11 + w1,2

2 × x21

z12 = 0.35 + 0.15× 0.05 + 0.2× 0.1 = 0.3775

y12 = σ(z1

y12 = 1/(1 + e−0.3775) = 0.5932

Repeat the process for z22 and y2

z22 = b2 + w2,1

2 × x11 + w2,2

2 × x21

z22 = 0.35 + 0.25× 0.05 + 0.3× 0.1 = 0.3925

y22 = σ(z1

y22 = 1/(1 + e−0.3925) = 0.5968

To simplify the computation we coud write: z12

w1,12 w1,2

w2,12 w2,2

︸︷︷︸

+b2 ×

︸︷︷︸

Z2 = W2 × X1 + B2

and then

Y2 = σ(Z2)

Propagate values of y12 and y2

2 by computing z13 and y1

z13 = b3 + w1,1

3 × y12 + w1,2

3 × y22

z13 = 0.6 + 0.4× 0.5932 + 0.45× 0.5968 = 1.1059

y13 = σ(z1

y13 = 1/(1 + e−1.1059) = 0.7513

Do the same for z23 and y2

z23 = b3 + w2,1

3 × y12 + w2,2

3 × y22

z23 = 0.6 + 0.5× 0.5932 + 0.55× 0.5968 = 1.2249

y23 = σ(z1

y23 = 1/(1 + e−1.2249) = 0.7729

Compute the error of the network where yexp is the vectorof expected values:

E(y3) = 12∑2

i=1(yiexp − y i

E(y3) = 12

exp − y13 )

2 + (y2exp − y2

E(y3) = 12

((0.01− 0.7513)2 + (0.99− 0.7729)2

)E(y3) = 1

2 (0.5496 + 0.0471)

E(y3) = 12(0.5496 + 0.0471) = 0.2983

We need to compute the gradient of the error to updateW3:

y13 = σ(z1

W 1,13

E(y3)z1

We apply the chain rule for w1,13 :

∂E(y3)

∂w1,13

= ∂E(y3)

∂y13× ∂y1

3∂z1

3× ∂z1

∂w1,13

where∂E(y3)

∂y13

= 2× 12 × (y1

exp − y13 )×−1

∂y13

∂z13

= σ′(z13 ) = y1

3 × (1− y13 )

∂z13

∂w1,13

∂E(y3)

∂w1,13

= −(y1exp − y1

3 )× y13 × (1− y1

3 )× y12

= −(0.01− 0.7513)× 0.7513× (1− 0.7513)× 0.5932

= 0.7413× 0.1868× 0.5932

= 0.0821

For w1,23 :

∂E(y3)

∂w1,23

= ∂E(y3)

∂y13× ∂y1

3∂z1

3× ∂z1

∂w1,23

where∂E(y3)

∂y13

= 2× 12 × (y1

exp − y13 )×−1

∂y13

∂z13

= σ′(z13 ) = y1

3 × (1− y13 )

∂z13

∂w1,23

∂E(y3)

∂w1,23

= −(y1exp − y1

3 )× y13 × (1− y1

3 )× y22

= −(0.01− 0.7513)× 0.7513× (1− 0.7513)× 0.5968

= 0.7413× 0.1868× 0.5968

= 0.0826

For w2,13 :

∂E(y3)

∂w2,13

= ∂E(y3)

∂y23× ∂y2

3∂z2

3× ∂z2

∂w2,13

where∂E(y3)

∂y23

= 2× 12 × (y2

exp − y23 )×−1

∂y23

∂z23

= σ′(z23 ) = y2

3 × (1− y23 )

∂z23

∂w2,13

∂E(y3)

∂w2,13

= −(y2exp − y2

3 )× y23 × (1− y2

3 )× y12

= −(0.99− 0.7729)× 0.7729× (1− 0.7729)× 0.5932

= −0.2171× 0.1755× 0.5932

= −0.02260

For w2,23 :

∂E(y3)

∂w2,23

= ∂E(y3)

∂y23× ∂y2

3∂z2

3× ∂z2

∂w2,23

where∂E(y3)

∂y23

= 2× 12 × (y2

exp − y23 )×−1

∂y23

∂z23

= σ′(z23 ) = y2

3 × (1− y23 )

∂z23

∂w2,23

∂E(y3)

∂w2,13

= −(y2exp − y2

3 )× y23 × (1− y2

3 )× y12

= −(0.99− 0.7729)× 0.7729× (1− 0.7729)× 0.5968

= −0.2171× 0.1755× 0.5968

= −0.02274

We now can update W3:

W ?3 = W3 − η

0.0821 0.0826

−0.0226 −0.0227

where η is the learning rate, we set it to 0.5 in this case

W ?3 =

[0.358916479717885 0.4086661860762330.511301270238737 0.561370121107989

Limits of NN

Limits of Neural Networksdoes the use of the gradient function gives theminimum ?like for Maximum Parsimony: does the minimumrepresent the best network ?number and size of the hidden layers ?

6. Interesting linksand applications

to explore in greater depth the course, follow those links:a series of very intersting videos about NN:3Blue1BrownApple Watch Detects Signs of DiabetesSolving SpaceNet Road Detection Challenge WithDeep LearningDeep neural network from scratch from FlorianCourtial

Toolkits

There are many tookits for NN available for many languages:

GPU computing: cuDNN (NVidia)Theano (University of Montreal)Tensorflow (Google)Caffe (Berkeley AI Research)MXNet (Microsoft, Nvidia, Intel, ...)many more on wikipedia

Synaptic.js

Synaptic.jsSynaptic.js defines itself as the javascript architecture-freeneural network library for node.js and the browser

you can easily define a NNtrain it efficientlyintegrate the code in a web page

Synaptic.js for XOR

Manuallyvar manualTrainingSet = [

{ input: [0,0], output: [0] },

{ input: [0,1], output: [1] },

{ input: [1,0], output: [1] },

{ input: [1,1], output: [0] }

GeneratedgeneratedTrainingSet = [];

for (var i = 0; i < 4; ++i) {

var op1 = Math.trunc(i/2);

var op2 = Math.trunc(i & 1);

input=[op1 , op2];

generatedTrainingSet.push({ input,

"output": [Math.trunc(op1 ^ op2)] });

Synaptic.js for XOR

Training of the neural networkDon’t use myTrain.trainXOR() which automatically pro-vides a XOR training set, but use train(..)

// var trainingSet = manualTrainingSet;

var trainingSet = generatedTrainingSet;

myTrainer.train(trainingSet);

Synaptic.js for IRIS

Neural Network for the IRIS datasetModify the example of the XOR network to create anetwork for the IRIS dataset

take the IRIS dataset from WEKA and convert it toJSONload the JSON data into the web page using JQuerytrain the network and display the results

JQuery<script type="text/javascript"

src="https://ajax.googleapis.com/ajax/libs/

jquery/2.1.3/jquery.min.js">

</script>

var trainingSet = [];

$(document).ready(function(){

$.getJSON("iris.json", function(result){

for (i in result) {

</script>

NormalizationTo get the best results you need to normalize the data, forexample:

feature scaling:

x ′ =x − xmin

xmax − xmin

standard score:x ′ =

x − µσ

where µ and σ are respectively the mean andstandard deviation of the data

6. End

University of Angers - Faculty of Sciences

UA - Angers2 Boulevard Lavoisier 49045 AngersCedex 01Tel: (+33) (0)2-41-73-50-72

Data Mining - Neural Networks - univ-angers.frricher/dm/data_mining_6_ann.pdf · Dr. Jean-Michel RICHER Data Mining - Neural Networks 13 / 79. ANN Paul Werbos, 1974 in 1974 developed

Documents

1974 Yamaha MX and YZ125 DB Jan 1974 - jonesmxcollection.com

3-6-1974 Spectator 1974-03-06

Neural Communication: The Neural Chain

Κερύνεια 1974 - Kyrenia 1974 - Girne 1974

44th NCAA Wrestling Tournament 1974 3/14/1974 to …

Neural Communication: The Neural Chain Module 7: Neural and....

ELECTRICAL CHARTS INDEX - … 05 SUB-SECTION 01 (ELECTRICAL....

CPI Detailed Report March 1975 · 2018-11-07 · CPI WPI...

Cilt8 Temmuz-1974 Sayı3 - mikrobiyolbul.org · 1974 - 1974...

Neural and Fuzzy Neural Networks

GenRad Today March 1974 - GenRad Experimenter March 1974

- 1974 - The Gazette News/1974 July - Dec...

Neural Networks Neural Networks

MECHANORECEPTORS AND MINIMA L REFLEX ACTIVITY ...150 C. K......

44th NCAA Wrestling Tournament 1974 3/14/1974 to...

[XLS]static.springer.comstatic.springer.com/sgw/documents/13...