XOR with intermediate (“hidden”) units Intermediate units can re-represent input patterns as new patterns with altered similarities Targets which are not linearly separable in the input space can be linearly separable in the intermediate representational space Intermediate units are called “hidden” because their activations are not determined directly by the training environment (inputs and targets) 1 / 33 hidden output Hidden-to-output weights can be trained with the Delta rule How can we train input-to-hidden weights? Hidden units do not have targets (for determining error) Trick: We don’t need to know the error, just the error derivatives 2 / 33 Delta rule as gradient descent in error (sigmoid units) n j = i a i w ij a j = 1 1+ exp (-n j ) Error E = 1 2 j (t j - a j ) 2 w ij a i → n j → t j a j → E Gradient descent: w ij = - ∂ E ∂ w ij ∂ E ∂ w ij = ∂ E ∂ a j da j dn j ∂ n j ∂ w ij = - (t j - a j ) a j (1 - a j ) a i w ij = - ∂ E ∂ w ij = (t j - a j ) a j (1 - a j ) a i 3 / 33 Generalized Delta rule (“back-propagation”) n j = i a i w ij a j = 1 1+ exp (-n j ) Error E = 1 2 j (t j - a j ) 2 hidden output n i → w ij a i → n j → t j a j → E Gradient descent: w ij = - ∂ E ∂ w ij ∂ E ∂ n j = ∂ E ∂ a j da j dn j = - (t j - a j ) a j (1 - a j ) ∂ E ∂ w ij = ∂ E ∂ n j ∂ n j ∂ w ij = ∂ E ∂ n j a i ∂ E ∂ a i = j ∂ E ∂ n j ∂ n j ∂ a i = j ∂ E ∂ n j w ij 4 / 33