This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ECE421 – TUT0102Tutorial 1: Introduction to Tensorflow
First part: downloading and installing Anaconda, Tensorflow 1.x and Jupyter Notebook
Second part: basic concepts, Tensorflow 1.x models, Tensor graph, and eager execution mode
Third part: tf.placeholder and additional topics such as Tensorflow optimizers, Numpy and Tensorflow 2.x
Last part: code demo
1
Fun Fact About Tensorflow
President Obama catches a pregnant woman who
almost fainted during his speech on Healthcare.gov
in the Rose Garden (2013)
Meet Karmel Allision, manager for the TensorFlow team at
Google Brain (2019)
This happens when
I talk too long
2
What is Tensorflow?
Open-source library for numerical computation in large-scale machine learning.
Uses Python for building models, executes models in C++ for performance. In other words, Python API with C++ runtime.
Operations are implemented by both CPU and GPU.
CPU: few cores/processors, cores are fast, good at sequential tasksGPU: thousands of cores, cores are slow, good at parallel tasks (e.g., matrix multiplication!)
3
What is Jupyter Notebook?Web-browser based interactive coding environment
Consists of three parts:
1. Web-browser for users to write code snippets
2. Notebook server for saving file with .ipynb extension on local disk
3. Kernel responsible for executing code snippets and returning results
General purpose, not just for writing Tensorflow code
Can be run in the cloud by using Google Colab, i.e., no local resources necessary.
4
Setting up Tensorflow and Jupyter (Windows)
(Follow the guide on Course Website for Linux or MacOS)
Steps:
1. Install Anaconda via Course Link
2. Open Anaconda Prompt and install Tensorflow
3. Open Anaconda Navigator and install Jupyter notebook
Note: Anaconda helps to install, modify and manage Python libraries
5
Setting up Tensorflow and Jupyter (Windows)
(Follow the guide on Course Website for Linux or MacOS)
Steps:
1. Install Anaconda via Course Linkhttps://repo.continuum.io/archive/index.html
Search for the file ‘Anaconda3-4.2.0-Windows-x86_64.exe’
Link in handout: “Tensorflow Installation Guide Using Anaconda”
Note: we are installing the CPU version only
GPU version requires carefully choosing CUDA Toolkit and cuDNN
tf.placeholder(): instantiates a container whose value we will provide when running the program. This is usually used for our input data and training labels.
x = tf.placeholder(tf.int32)
>>Tensor(“Placeholder:0”, dtype = int32)
Tensorflow Basics – Creating Tensors
21
tf.constant(): instantiates a constant tensor
x = tf.constant(1)
>>Tensor(“Constant:0”,shape=(),dtype = int32)
tf.Variable(): instantiates a variable tensor whose value can be changed. This is usually used for trainable variables such as weight and biases.
tf.placeholder(): instantiates a container whose value we will provide when running the program. This is usually used for our input data and training labels.
x = tf.placeholder(tf.int32)
>>Tensor(“Placeholder:0”, dtype = int32)
Tensorflow Basics – Creating Tensors
Simply means, the tensor can be
assigned a different value
22
Tensorflow Basics – Creating Tensors
tf.fill(): similar to tf.constant, but only for scalar valued tensors
x = tf.fill([2,2], 1) #creates a 2x2 tensor of ones
tf.random.normal(): tensor with random shape and value as specified with mean and standard deviation.
x = tf.random.normal([2,2], 5.0, 10.0)
#creates a Gaussian distributed 2x2 tensor with mean 5 and variance 10
This can be used to initialize weights associated with a neural network.
x = tf.Variable(tf.random.normal([2,2], 5.0, 10.0))
23
Tensorflow Basics – Datatypes (dtypes)
Get the datatype using tf.tensor.dtype
tf.Variable(3.14159265359).dtype
>> tf.float32_ref
'Incompatible type conversion error’ very common!
Typecasting using tf.cast(Tensor, dtype): convert from float32 to float64
y = tf.cast(tf.Variable(3.14159265359),tf.float64)
>> Tensor(“Cast:0”, shape = (), dtype = float64)
24
Tensorflow Basics – Shape, Size and Axis
tensor.get_shape(): operation that returns the shape of tensor
x = tf.constant([[1,2,3], [4,5,6]])
print(x.get_shape())
>> (2,3)
Size: the total number of elements in a tensor.
Axes: indices corresponding to the shape of a tensor.
For the previous example: the shape is (2,3),
axis 0 corresponds to the first entry, which are the rows
axis 1 corresponds to the second entry, which are the columns.
1 2 34 5 6
25
Tensorflow Basics – Manipulating Tensors
tf.matmul: multiple two tensors of appropriate dimensions
a = tf.constant([[1, 2, 3], [4, 5, 6]])
b = tf.constant([[7, 8], [9, 10], [11, 12]])
c = tf.matmul(a, b)
tf.square: element-wise square of all entries of a tensor
x = tf.constant([1, 2, 3, 4, 5])
y = tf.square(x)
Sum up entries and take square root, we obtain Euclidean norm of x
1 2 34 5 6
7 89 1011 12
12 22 32 42 52
26
Tensorflow Basics – Manipulating Tensors tf.reduce_sum: sum across a given axis (if no axis provided, sum entire tensor)
Broadcasting: set of rules to add/sub/mult … tensors of different shapes
x = tf.constant([1, 2, 3])
y = tf.constant(4)
z = x + y
Not very different from MATLAB…
x = tf.constant([1, 2, 3])
y = tf.constant([4,5])
z = x + y
You can look up all the broadcasting rules on Tensorflow website
[5, 6, 7]
𝐸𝑟𝑟𝑜𝑟!
28
Building a Tensorflow Model – Part 1Tensorflow (version 1.x) allows for two modes of computation
1. Computation (or dataflow) graph: builds a graph consisting of tensors (edges) and operations (nodes), then “run the graph” to get the value. This is the default setting and the most common way to build a model (in version 1.x).
Why a computation graph?
29
Building a Tensorflow Model – Part 1Tensorflow (version 1.x) allows for two modes of computation
1. Computation (or dataflow) graph: builds a graph consisting of tensors (edges) and operations (nodes), then “run the graph” to get the value. This is the default setting and the most common way to build a model (in version 1.x).
Why a computation graph?
GPU or multi-core CPU could compute values of different branches or subgraphs. This parallelizes training and scales up models.
Computation graph resides in CPU or GPU memory.
Computation graph are visualized using Tensorboard (more to come).
30
Tensorflow allows for two modes of computation
1. Computation graph
Aside, deep neural networks are usually
visualized as computation graphs as well in
many publications
This diagram is from a recent paper titled “A
Style-Based Generator Architecture for
Generative Adversarial Networks” by Karras,
Laine, Aila (Dec 12, 2018)
Building a Tensorflow Model – Part 1
31
Tensorflow allows for two modes of computation
1. Computation graph
Computation graph = Block diagrams
Tensors = Signals
Tensor Operations = Systems
Optimizers = Feedback control laws
Weights = States variablesConclusion: most ML models are discrete-time nonlinear
feedback control systems! Certainly all the ones in this course.
Next tutorial: perceptron algorithm
Building a Tensorflow Model – Part 1
32
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2Build computation graph (we will show later)
Import tensorflow module (must have!)
33
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2Build computation graph (we will show later)
Import tensorflow module (must have!)
We do not have the value of f at this stage!!!
All tensorflow does at this point is construct the computation graph
It is not possible to get the value of any tensor without running the “graph”.
How to run the graph?
f.eval()
Error: cannot evaluate tensor.
34
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print(sess.run(f))
Build computation graph
Create a node to initialize all variables (x,y) of the graph
Run graph within a “session”, which is the execution environment. sess.run(init_op) must be run before any other node, as it actually sets x = 3, y = 4.
>> 42
Import tensorflow module (must have!)
If do not include: “Error: Attempting to uninitialized value Variable”
35
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph – alternative version
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print(sess.run(f))
import tensorflow as tf
sess = tf.InteractiveSession()
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
sess.run(init_op)
print(sess.run(f))
sess.close()
This method expands the scope of the session
Just remember to close it
36
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph – alternative version
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print(sess.run(f))
Let’s visualize the computation graph using Tensorboard!
import tensorflow as tf
sess = tf.InteractiveSession()
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
sess.run(init_op)
print(sess.run(f))
sess.close()
37
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
writer = tf.summary.FileWriter(“path to directory”, sess.graph)
sess.run(init_op)
print(sess.run(f))
writer.close()
Procedure to open Tensorboard:
1. Add a two lines of code in the ‘with’ block
2. Open Anaconda Prompt, and type:
tensorboard --logdir=path to directory --host=127.0.0.1
3. Type in localhost:6006 in a browser
e.g., “C:/Users/bolin/Desktop/Tutorial_1/Example”absolutely no spaces in directory names
38
Example 1: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
writer = tf.summary.FileWriter(“path to directory”, sess.graph)
sess.run(init_op)
print(sess.run(f))
writer.close()39
A lot more going on under the hood…
Notice difference
between tf.constant
(2), and tf.variable (y)
A constant is just an
“injection” into a node
A variable has more
going on because it
can be reassigned
during execution
40
Build computation graph
Import tensorflow module (must have!)
Example 2: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph, but using tf.constant
import tensorflow as tf
x = tf.constant(3, name="x")
y = tf.constant(4, name="y")
f = x*x*y + y + 2
41
Build computation graph
Import tensorflow module (must have!)
Example 2: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph, but using tf.constant
import tensorflow as tf
x = tf.constant(3, name="x")
y = tf.constant(4, name="y")
f = x*x*y + y + 2
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
writer = tf.summary.FileWriter(“path to directory”, sess.graph)
sess.run(init_op)
print(sess.run(f))
writer.close()>> 42
If only using tf.constants,
do not need to initialize
42
Example 2: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 2Solve using computation graph, but using tf.constant
import tensorflow as tf
x = tf.constant(3, name="x")
y = tf.constant(4, name="y")
f = x*x*y + y + 2
with tf.Session() as sess:
writer = tf.summary.FileWriter(“path to directory”, sess.graph)
Optimization node automatically finds all the variable tensors that the loss function depends on, and updates them one step at a time during execution.
47
Example: minimize 𝑓(𝑥) = (log(𝑥))^2import tensorflow as tf
>> … step 100: x = 1.000216 log(x)^2: 4.6555176e-10 48
Set initial value to be 5
Define function
Define optimizer (with learning rate 0.1)
Define optimizer node
Run gradient descent for 100 stepsNote: range(100) = 0, 1, 2, … , 99
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0.1,10,100)
f = np.square(np.log(x))
plt.plot(x, f, 'b')
plt.show()
Difference between tf.variable and tf.placeholder
tf.variable has a value upon construction, can be “mutated” (assigned different values). For us, these are the trainable weights.
tf.placeholder do not possess a value at the construction phase of the graph, cannot be mutated. For us, these are the training data and labels.
Only needs to know the type of placeholder (although can also specify shape)
We “feed” data as a dictionary to the tf.placeholder at the same time when we evaluate the graph using sess.run (syntax: placeholder : value)
49
Example 4: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 𝑐Solve using computation graph, where 𝑐 is a placeholder
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
c = tf.placeholder(tf.int32)
f = x*x*y + y + c
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
writer = tf.summary.FileWriter(“path to directory”, sess.graph)
sess.run(init_op)
print(sess.run(f, c : 2))
writer.close()
>> 42
Create a placeholder
Get the value of tensor f by feeding in value of c
50
Example 4: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 + 𝑐Solve using computation graph, where 𝑐 is a placeholderimport tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
c = tf.placeholder(tf.int32)
f = x*x*y + y + c
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
writer = tf.summary.FileWriter(“path to directory”, sess.graph)
sess.run(init_op)
print(sess.run(f, c : 2))
writer.close()51
Example 4: Given 𝑥 = 3, 𝑦 = 4, find 𝑓 = 𝑥2𝑦 + 𝑦 +𝑐 + 𝑑, 𝑔 = 𝑥 ∗ 𝑦 + 𝑑 where 𝑐, 𝑑 are placeholdersimport tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
c = tf.placeholder(tf.int32)
d = tf.placeholder(tf.int32)
f = x*x*y + y + c + d
g = x*y + d
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print(sess.run([f,g], c:2, d:3))
>> [45, 15]
Output multiple values by concatenating them into an array
52
A quick word about Numpy…Numpy is the numerical computing library for Python
import numpy as np
Similar to MATLAB, low learning curve
It is used together with Tensorflow! (For instance, Tensorflow automatically converts feed dictionary numpy arrays into Tensor objects)
Popular applications
Loading data: np.load('./data.npy’)
Initializing variables: np.zero
Math operations: np.argmax, np.transpose53
About Tensorflow 2.0
Tensorflow is migrating from 1.x to 2.0
Here are some main changes:
1. Fully integrated with Keras, a high level API for Tensorflow (much cleaner syntax)
2. Uses eager execution by default, the with tf.session() workflow is removed
3. Tensorboard is fully integrated with Tensorflow (no more need to resolve localhost:6006)
4. Miscellaneous removal of duplicate functionalities (major problem in TF 1.x)
5. Syntactical consistency54
Current status of Tensorflow 2.0 (Sept, 2019)
Tensorflow 2.0 Release Candidate (RC) version is available as nightly build, but full release not available until around December of 2019.
All Tensorflow 1.x code will have to be migrated…many 1.x APIs depreciated, have towrite “tf.compact.v1” everywhere
Should I learn TF 1.x or 2.0? What do I do?
From Tensorflow engineers at Google:
Learn Keras.
Keras wraps around TF 1.x code and will be fully compatible with TF 2.0.
55
Topics that we left out…tf.data (possible replacement for tf.placeholder)
tf.assign (another way to assign values into variables / alternative to initializers)
tf.expand_dims and tf.squeeze (commonly used shape manipulation)
Saving and resuming your session (very simple code: see other document)
Trainability of Variables (non-trainable variables cannot be modified by Optimizer)
Dynamic versus static shape (placeholders often do not have a dynamic shape)
Plotting results using Matplotlib and seaborn
Estimators and feature columns (another high-level tensorflow API)
Other softwares and tools: Colab, GCE, Pytorch, Pandas, Scikit, JAX, Sonnet, etc.
You will find out about them as we go along…56
Recommended Programming References
Deep Learning with Python – François Chollet (author of Keras)
Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow – Aurélien Géron
(electronic version already out, paperback out in two weeks!)
Python Data Science Handbook – Jake VanderPlas
An introduction to Computer Science using Python 3.6 – Paul Gries, Greg Wilson, Jason Montojo, Jennifer T. Campbell
57
Tutorial 2
Binary Linear Classification and the Perceptron Algorithm
Bolin Gao
Sept 19, 2019
Binary Linear Classification
In binary linear classification, we are given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, d ≥ 1, and yn ∈ Y =+1,−1. For instance, suppose we have a data set of images of cats and dogs. Each example xn is a vectorizedRGB image of a cat or a dog. We assume yn = +1 if it is a cat, yn = −1 if it is a dog.
Next, we seek a function that takes in xn ∈ X and outputs a prediction vector y ∈ Y. The function we selectwill come from the following hypothesis set,
H =
h : X → Y, x 7→ y
∣∣∣h(x) = sgn(d∑
i=1
wixi + b), w ∈ Rd, b ∈ R
(1)
where sgn is the signum function, sgn(d∑
i=1
wixi + b) =
+1 if
d∑i=1
wixi > −b
−1 ifd∑
i=1
wixi < −b
Suppose we have selected an arbitrary hypothesis function h ∈ H, the next step would be to evaluate how goodsuch a function is at predicting our label within the examples that we are given. Our assumption is that, if ourhypothesis does well on the examples we are given, then hopefully it will do well on the examples we are not given.To do so, we introduce the 0− 1 loss, L0−1 : Y × Y → 0, 1.
L0−1(h(x), y) =
0 if h(x) = y
1 if h(x) 6= y(2)
Remark 1. We note that the 0 − 1 loss is equivalent written as (in your book) L0−1(h(x), y) =N∑
n=1Jh(xn) 6= ynK.
This is the Iverson bracket notation.
Then our in-sample error is the sum total of our loss, Ein : H → R,
Ein(h) =1
N
N∑i=1
L0−1(h(xn), yn) (3)
Formally, binary linear classification involves the following problem
Binary Linear Classification Problem
Given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, and yn ∈ +1,−1
Find a hypothesis g ∈ H, such that g = arg minh∈H
Ein(h), where Ein(h) =1
N
N∑i=1
L0−1(h(xn), yn).
Observe that the above problem is formulated in terms of minimizing over a function space. Since our hypothesisis associated with the weight parameter w ∈ Rd and bias b ∈ R, which uniquely identifies each hypothesis function,hence we can equivalently state the problem explicitly in terms of the weight vector as follows.
Binary Linear Classification Problem (Equivalent formulation)
Given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, and yn ∈ +1,−1Find a weight vector w? ∈ Rd and bias b? ∈ R, such that (w?, b?) = min
w∈W,b∈REin(w, b), where Ein(w) =
1
N
N∑n=1L0−1(sgn(
d∑i=1
wixi + b), yn).
We note that it is common for the bias to be grouped in with the weights. To this end, we can redefine ourweight vector and our data vector as as
w =[b w1 w2 . . . wd
]∈ Rd+1 (4)
andx =
[1 x1 x2 . . . xd
]∈ Rd+1 (5)
This is the notation we will use in the following section.
1
Perceptron Learning algorithm
We claim that an optimal weight vector w? could be found whenever the dataset D is linearly separable.Assuming that our data is linearly separable, then the perceptron learning algorithm (PLA) is as follows:
Perceptron Learning Algorithm
Input: D = ∅, assumed to be linearly separableOutput: Optimal weight w?
while There exists n ∈ 1, . . . , N such that yn 6= sgn(w>k xn) doFind any n such that yn 6= sgn(w>k xn)wk+1 = wk + ynxn
Remark 2. Observe that if our data is not linearly separable, then the weight will forever oscillate.
An intuitive proof showing why wk converges to w? is as follows. Suppose (xn, yn) is data that was misclassifiedat iteration k. Then this must mean that w>n xn has the opposite sign of yn (by definition of what misclassifiedmeans). Then yn(w>n xn) < 0. Now we need to show that the next weight improves upon the classification, whichmeans that yn(w>k xn) < yn(w>k+1xn). This can be shown by a direction calculation:
yn(w>k+1xn) = yn((wk + ynxn)>xn)
= yn(w>k xn + ynx>n xn)
= yn(w>k xn + yn‖xn‖22)
= ynw>k xn + y2n‖xn‖22
= ynw>k xn + ‖xn‖22
> ynw>k xn
We note that the last inequality always hold, because the first index of xn is assumed to be 1, i.e., xn0 = 1, andthe norm ‖x‖ = 0 if and only if x = 0. This means that we are moving in the correct direction.
Theorem 1. Assuming that w(0) = 0, the perceptron learning algorithm converges in finite number ofiterations.
Remark 3. Following the convention in your textbook, we will use the notation w(t) to denote the weight vectorat iteration t = 1, . . .. The proof roughly follows the one given by Block and Novikoff in 1962. The symbol ‖ · ‖2denotes the Euclidean norm, i.e., ‖v‖2 =
√v21 + v22 + . . . v2d, for any vector v ∈ Rd.
Proof. We will divide the proof into the following parts:
1. Let ρ = min1≤n≤N
yn(w?>xn), show ρ > 0.
If w? is optimal, then yn(w?>xn) > 0,∀n. This implies that ρ > 0.
2. Show that w(t)>w? ≥ w(t− 1)>w? + ρ and conclude w(t)>w? ≥ tρ.
(Use Definition) w(t)>w? = (w(t− 1) + ynxn)>w?
= w(t− 1)>w? + ynx>nw
?
≥ w(t− 1)>w? + min1≤n≤N
ynx>nw
?
= w(t− 1)>w? + ρ
2
Starting from t = 1, we have,
w(1)>w? ≥ ρw(2)>w? ≥ w(1)>w? + ρ = 2ρ
...
w(t)>w? ≥ tρ.
3. Show that ‖w(t)‖2 ≤ ‖w(t − 1)‖2 + ‖x(t − 1)‖2, where (x(t − 1), y(t − 1)) is a misclassified instance at theprevious time instance.
Recall that from (2), we have, w(t)>w? ≥ tρ. And from (4), we have, ‖w(t)‖22 ≤ tR2 implies ‖w(t)‖2 ≤√tR,
w(t)>w?
‖w(t)‖2=⇒ ‖w(t)‖2 ≤
√tR
By Cauchy-Schwartz inequality, w(t)>w? ≤ ‖w(t)‖2‖w?‖2, we have,
‖w(t)‖2‖w?‖2‖w(t)‖2
≥√tρ
R=⇒ t ≤ R2‖w?‖2
ρ2
where R = max1≤n≤N
‖xn‖2, ρ = min1≤n≤N
yn(w?>xn) and w? is the optimal weight.
Remark 4. The interpretation of this result is as follows: the perceptron updates every time an incorrect label isfound. Hence t represents the total number of updates that the perceptron makes. Since the number of updates is
bounded (at most T =R2‖w?‖2
ρ2), therefore w(t)→ w? in finite number of iterations.
3
Observe also that we can write the bound as,
√tρ
R=⇒ t ≤ R2‖w?‖2
ρ2=
R2[ρ2
‖w?‖2
] = R2
[ρ2
‖w?‖2
]−1= R2
[ρ
‖w?‖
]−2.
Let d =ρ2
‖w?‖2, then d is by definition the smallest distance between a point in the data set and the hyperplane
H = x ∈ Rd|w?>x = 0. Therefore, if the distance d is too small, then it may possibly take more iterations forthe perception to get the weight correctly, although in practice, the perceptron algorithm converges to the optimalweight very quickly.
In linear regression, we are given a data set D = (xn, yn)Nn=1, where xn ∈ X = Rd, d ≥ 1, and yn ∈ Y = R. Next,we seek a function that takes in xn ∈ X and outputs a prediction vector y ∈ Y. The function we select will comefrom the following hypothesis set,
H =
h : X → Y, x 7→ y
∣∣∣h(x) =d∑
i=1
wixi + b, w ∈ Rd, x ∈ ×Rd
(1)
or in compact form,
H =
h : 1 × X → Y, x 7→ y
∣∣∣h(x) =d∑
i=0
wixi, w ∈ Rd+1, x ∈ 1 × Rd
(2)
The notation 1×Rd represents the set of vectors whose leading coefficient is 1, i.e., v ∈ Rd+1|v = (1, x1, x2, . . . , xd).
We will use the convention,
w =[b w1 w2 . . . wd
]> ∈ Rd+1 (3)
x =[1 x1 x2 . . . xd
]> ∈ Rd+1 (4)
throughout the rest of this section.
We evaluate our hypothesis function through the squared loss,
Lsq(h(x), y) = (h(x)− y)2, y ∈ R (5)
Then the in-sample error is given by,
Ein(h) =1
N
N∑n=1
Lsq(h(xn), yn) =1
N
N∑n=1
(w>xn − yn)2. (6)
A more compact form of the in-sample error can be constructed as follows. Define data matrix and the targetvector as,
X =
x>1...x>N
∈ RN×(d+1) and y =
y1...yN
(7)
where we assume that the number of data is much larger than the number of feature vectors, i.e., N >> d+ 1 Thedata matrix can also be expressed as a set of column vectors,
X =[q0 q1 . . . qd
](8)
where each qi, i ∈ 0, . . . , d is one feature across all training examples, with q0 = 1.
Then the in-sample error function can be re-written as,
Ein(w) =1
N
N∑n=1
(w>xn − yn)2 =1
N
∥∥∥∥∥∥∥|w>x1...
w>xN
− y1...yN
∥∥∥∥∥∥∥2
2
=1
N
∥∥∥∥∥∥∥x>1...x>N
w − y1...yN
∥∥∥∥∥∥∥2
2
=⇒ Ein(w) =1
N‖Xw − y‖22
Linear Regression Problem
Given a data set D = (xn, yn)Nn=1, where xn ∈ Rd, and yn ∈ RFind a weight vector w? ∈ Rd+1, such that w? = arg min
w∈Rd+1
Ein(w), where Ein(w) =1
N‖Xw − y‖2.
Remark 1. The solution to the linear regression problem is referred to as the least-squares solution. Suppose thatX is square and invertible, then the solution to the above problem is simply, w? = X−1y. This solution yieldsan in-sample error Ein(w?) = 0. However, this would mean that N = d + 1 (the number of data is equal to thedimension of our data vector (plus bias)). In practice, this is almost never true.
1
Solution to the Linear Regression problem
Theorem 1. Suppose that the matrix X>X is invertible, then w? = (X>X)−1X>y is the solution to thelinear regression problem.
Remark 2. We will offer the proof by using linear algebra. The proof using differentiation is in the lecture notes.
Proof. To begin, we need to define four fundamental objects in linear algebra:
Column space of X: C(X) = y ∈ RN |y = Xw =d∑
i=0
wiqi, wi ∈ R, qi ∈ RN. This is the set of all
linear combinations of columns of X.
Orthogonal complement of the column space C(X): C(X)⊥ = e ∈ RN |e>y = 0,∀y ∈ C(X) ⊆ RN
Null space of X: N(X) = w ∈ Rd+1|Xw = 0, 0 ∈ RN ⊆ Rd+1
Left Null space of X: N(X>) = e ∈ RN |X>e = 0, 0 ∈ Rd+1 ⊆ RN .
We claim that C(X)⊥ = N(X>).(⊆) First, we show that N(X>) ⊆ C(X)⊥. Let e ∈ N(X>), then X>e = 0 and hence e>X = 0>. Multipleboth side by w yields, e>Xw = 0>w =⇒ e>y = 0, y = Xw ∈ C(X) =⇒ e ∈ C(X)⊥.(⊇) Next, we show that N(X>) ⊇ C(X)⊥. Let e>y = 0 =⇒ e>(Xw) = 0 =⇒ (X>e)>w = 0 =⇒X>e = 0,∀w 6= 0. Hence e ∈ N(X>) =⇒ N(X>) ⊇ C(X)⊥.
Next, we define the projection of y ∈ RN onto C(X) as,
yls = Xw?, w? ∈ Rd+1, such that (yls − y)>yw = 0,∀yw ∈ C(X). (9)
Since (yls − y)>yw = 0,∀yw ∈ C(X) =⇒ yls − y ∈ C(X)⊥. But recall that C(X)⊥ = N(X>), thereforeyls − y ∈ N(X>). This means, X>(yls − y) = 0. Substituting in the definition of yls we have, X>(Xw?) =X>y, therefore,
w? = (X>X)−1X>y (10)
whenever X>X is invertible, and the orthogonal projection is yls = X(X>X)X>y.
Remark 3. Note thatX(X>X)X> 6= I. A possible wrong proof is as follows: X(X>X)−1X> = X(X−1X>−1
)X> =I. However, the error is that X is not invertible.
Remark 4. The symmetric matrix X>X is always positive semidefinite for any X: this means, z>X>Xz ≥ 0for all z ∈ Rd+1. The proof is very straightforward. z>X>Xz = (Xz)>(Xz) = ‖Xz‖22 ≥ 0. A fact about sym-metric positive semidefinite matrix is that the matrix has at least one eigenvalue at zero, with the rest being positive.
However, a positive semidefinite matrix is never invertible. This is because the determinant is the product ofeigenvalues of a matrix, and for positive semidefinite matrices, it is always zero. Think back to the inverse formulaof 2× 2 matrices, this causes a division by 0 problem.
To ensure that X>X is invertible, we assume that it is positive definite: this means, z>X>Xz > 0 for allz ∈ Rd+1\0. A symmetric positive semidefinite matrix has all positive eigenvalues.
2
Analysis of the In-Sample Error
Theorem 2. Assuming X>X is positive definite (hence invertible), then the in-sample error (ignoringcoefficient),
where yw = Xw, yls = X(X>X)−1X>y = Xw?, w? = (X>X)−1X>y. At the optimal weight, w?, thein-sample error is given by,
Ein(w?) = y>(I −X(X>X)−1X>)y (13)
Proof. First, we show that the Ein(w) can be decomposed into two parts.Observe that the two vectors yw− yls and y− yls are orthogonal to each other (written as yw− yls ⊥ y− yls).As one lies in the column space of X, and the other lies in the orthogonal complement of the column space.For the vector lying in the column space of X, we have,
Recall, that the level set of a function f : Rn → R is the set
Lc(f) = x ∈ Rn|f(x) = c, c ∈ R (18)
In this section, we briefly examine the level set of the in-sample error, and show that the level set is characterizedby the X>X matrix.
This set is most easily visualized in 2D and lower. Hence we assume that w =[w1 w2
]> ∈ R2 Similarly,
x =[x1 x2
]> ∈ R2 (ignoring the bias term). We assume that the matrix X>X is positive definite (all eigenvaluesare positive).
3
Figure 1: A possible graph of the in-sample error in 2D. The level sets are ellipses as clearly shown. Why is thisthe case, how does these ellipses tilt and where is the center of the ellipse?
From the previous section, the in-sample error can be written as,
Ein(w) = (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y) + y>(I −X(X>X)−1X>)y︸ ︷︷ ︸constant
(19)
Since the constant term only shifts the graph up or down, therefore we can safely ignore it in our analysis of thelevel set. Hence we obtain,
Ein(w) = (w −X(X>X)−1X>y)>X>X(w −X(X>X)−1X>y) = c (20)
where we assume that our constant c is equal to one, i.e., c = 1.
Let ∆w = w −X(X>X)−1X>y = w − w?, then,
Ein(w) = (∆w)>X>X(∆w) = 1 (21)
We proceed with simplifying the above expression further using the following well-known theorem.
Spectral Theorem
Theorem 3. Every symmetric matrix A has the factorization A = QΛQ>, where
Λ = diag(λ1, . . . , λn) =
λ1 ∅. . .
∅ λn
is a diagonal matrix, λi is the ith real eigenvalue of A, and
Q =[v1| · · · |vn
]is an orthogonal matrix (Q> = Q−1) and vi is an orthonormal (i.e., v>i vj = 0 if i 6= j, and v>i vi = 1)eigenvector associated with λi (i.e., Avi = λivi).
Since our matrix X>X is symmetric, therefore by the spectral theorem, we can write X>X = QΛQ>, where Λ andQ is defined above. In this case, Λ and Q ∈ R2×2. Hence,
Ein(w) = (∆w)>X>X(∆w)
= (Q>∆w)>ΛQ>∆w
= 1
(22)
Let ∆z = Q>∆w, then,
4
Ein(w) = (∆w)>X>X(∆w)
= (Q>∆w)>ΛQ>∆w
= (∆z)>Λ(∆z)
=[∆z1 ∆z2
] [λ1 00 λ2
] [∆z1∆z2
]= λ1∆z21 + λ2∆z22
=∆z21
(1/√λ1)2
+∆z22
(1/√λ2)2
= 1
(23)
This is precisely the equation of an axis-aligned ellipse, centered at the origin, in the ∆z-coordinates. If we
assume that λ2 > λ1, then the ellipse has major axis with length1√λ1
and minor axis with length1√λ2
. It is a
circle whenever λ1 = λ2. However, this is the equation in terms of the ∆z-coordinates. We wish to know what thelevel set looks like in w-coordinate.
Transforming back involves the following two steps,
1. First, recall that, ∆z = Q>∆w. Using the fact that Q> = Q−1, therefore we obtain Q∆z = ∆w. This means,
given a unit vector
[10
]in the ∆z-coordinates, we obtain Q
[10
]=[v1 v2
] [10
]= ∆w. This implies that the
unit vector in ∆w-coordinates is v1, the first eigenvector of X>X. The other unit axis maps to the othereigenvector, v2.
2. Now we have the representation of the ellipse in the ∆w-coordinates, to transform back into the w coordinates,
simply note that ∆w = w−w?. Consider the origin
[00
]in the ∆w-coordinates. Then this is exactly equivalent
to w? in the w-coordinates.
The following figure represents the series of transformation.
∆z1
∆z2 ∆w2
∆w1
w2
w1
b
w⋆
v1
v2
v1
v2
1√
λ2
1√
λ1
Figure 2: Far left: the original ellipse in the ∆z-coordinates. The length of the major and minor axis are shown(assuming λ2 > λ1). Middle: the ellipse in the ∆w-coordinates, the eigenvectors of X>X specifies the direction(tilt) of the ellipse. Far right: the ellipse in the w-coordinates, all vectors are shifted by w?.
Remark 5. It is important to note that such a transformation from the ∆z coordinates to ∆w coordinates will notdistort the ellipse. Meaning, that the length of the major and minor axes remain the same. To see this, simply
consider the point
1√λ10
. Then a multiplication by Q yields, Q
1√λ10
=1√λ1v1. But this new vector has the
exact same length as
1√λ10
. To see this, simply note that, ‖ 1√λ1v1‖ =
1√λ1‖v1‖ =
1√λ1
since v>1 v1 = ‖v1‖22 =
1 =⇒ ‖v1‖ = 1. (We made use of the orthonormality of v1.)
5
A Related Problem: Polynomial Curve Fitting
We show that polynomial curve fitting is a generalization of the linear regression for 1D data.Given D = (x1, y1), . . . , (xN , yN ), xn ∈ R, yn ∈ R, we wish to find a polynomial
h(x) = w0 + w1 + w2x2 + . . . wMx
M ,
M ≥ 0 denoting the order of the polynomial, such that, w =[w0 w1 . . . wM
]> ∈ RM+1, such that,
w? = arg minw∈Rd+1
Ein(w) (24)
where Ein =1
N
N∑n=1
(h(xn)− yn)2.
This problem almost looks like a linear regression problem. Recall that for linear regression, h(xn) = w>xn =w>ψ(xn). where ψ is the identity function, ψ : Rd+1 → Rd+1, x 7→ x.
Following this idea, we define feature map ψ : R→ RM+1,
ψ(x) =[1 x x2 . . . xM
]>(25)
then h(x) = w>ψ(x) =M∑i=0
wiψi(x)
Therefore, Ein(w) =1
N
N∑n=1
(h(xn)−yn)2 =1
N
N∑n=1
(w>ψ(x)−yn)2 =1
N
N∑n=1
(M∑i=0
wiψi(x)−yn)2 = ‖Ψw−y‖22, where,
Ψ =
1 x1 x21 . . . xM1...
......
...1 xN x2N . . . xMN
y =
y1y2...yN
Then the least squares solution corresponding to this problem can be written as,
w = (Ψ>Ψ)−1Ψ>y
and when M +1 = N (the number of features is equal to the number of data points), and Ψ is invertible, we obtain,
w = (Ψ>Ψ)−1Ψ>y = Ψ−1Ψ>−1
Ψ>y = Ψ−1y
1 # The following MATLAB code generates 10 randomly generated data points between the range 1 to 202 # Then fits a 4th order polynomial to it3
4 N = 10;5 x =linspace(1, N, N)6 y = round(1 + (20-1).*rand(N,1));7
8 M = 4;9 P = zeros(N, M+1);
10
11 for n = 1:1:N12 for m = 1:1:M+113 P(n, m) = x(n)ˆ(m-1);14 end15 end16
17 w = inv(P.'*P)*P.'*y;18 plot(x, y, 'ro')19
20 t = linspace(min(x), max(x), 1000);21
6
22 L = 0;23 for i = 1:1:M+124 L i = w(i).*t.ˆ(i - 1)25 L = L + L i26 end27
In logistic regression, we are given a data set D = (xn, yn)Nn=1, xn ∈ X = Rd, d ≥ 1, and yn ∈ Y = −1,+1.Even though it is called regression, the purpose of logistic regression is for classifying the data into the two labelsgiven above. Therefore officially, we would like to learn the following hypothesis:
H =h : 1 × X → Y, x 7→ y
∣∣∣h(x) = sgn(θ(w>x)− 0.5), w ∈ Rd+1, x ∈ Rd+1
where θ is the logistic function, θ : R→ (0, 1) = int([0, 1]), z 7→ θ(z) =exp(z)
1 + exp(z)=
1
1 + exp(−z).
We briefly list some properties of the logistic function:
• limz→∞
θ(z) = 1, limz→−∞
θ(z) = 0
• θ(0) = 0.5 (Interpretation: if data falls on the hyperplane, then w>x = 0, and θ(0) = 0.5 simply meansthat the classifier is not sure which class the data belongs to.)
• 1− θ(z) = θ(−z)
• θ(z) =d log(1 + exp(z))
dz= ∇ log(1 + exp(z))
• ∇θ(z) = θ(z)(1− θ(z)) =d2 log(1 + exp(z))
dz= ∇2 log(1 + exp(z))
Alternatively, we can treat our hypothesis simply as h(x) = θ(w>x) ∈ (0, 1) (e.g., Shalev-Shwartz and Ben-David’s Understanding Machine Learning book, page 98). However, notice that this hypothesis function does notmap to the target space Y = −1, 1. So the output y = h(x) is not a prediction of y ∈ −1, 1, but rather aprobability that implicitly predicts y. We will assume this hypothesis for the rest of this section.
Assume that probability of y given x is given by,
Pr[y = +1|x] = h(x)
Pr[y = −1|x] = 1− h(x)
where Pr denotes the probability measure, x, y are random variables. This generates a conditional probability massfunction,
Pw(y|x) =
h(x) y = +1
1− h(x) y = −1= θ(yw>x) (1)
We wish to maximize the joint probability that given x1, . . . , xN , we obtain y1, . . . , yN .
Derivation of the In-Sample Error and Its Gradient
The optimal weight vector w?, is the solution to the maximum likelihood problem,
w? = argmaxw∈Rd+1
Pw(y1, . . . , yN |x1, . . . , xN )
= argmaxw∈Rd+1
N∏n=1
Pw(yn|xn) (i.i.d. assumption)
In general, this problem is difficult to solve. Instead, we solve the related maximum log-likelihood problem,
which yields the same optimizer. Define the log-likelihood as log(∏N
n=1 Pw(yn|xn)) =N∑
n=1log(pw(yn|xn)), which we
can simplify to,
1
log(
N∏n=1
Pw(yn|xn)) =
N∑n=1
log(Pw(yn|xn))
= −N∑
n=1
log
(1
Pw(yn|xN )
)
= −N∑
n=1
log(1 + exp(−ynw>xn))
Then equivalently,
w? = argmaxw∈Rd+1
N∏n=1
Pw(yn|xn)
= argmaxw∈Rd+1
log(
N∏n=1
Pw(yn|xn))
= argmaxw∈Rd+1
−N∑
n=1
log(1 + exp(−ynw>xn))
= argminw∈Rd+1
N∑n=1
log(1 + exp(−ynw>xn))
Then we define Ein(w) =1
N
N∑n=1
log(1 + exp(−ynw>xn)
Remark 1. (In-Sample Error in Entropy Form) Observe that since Ein(w) =1
N
N∑n=1
log
(1
Pw(yn|xn)
)and Pw(y|x) =
h(x) y = +1
1− h(x) y = −1. Hence, we can rewrite our in-sample error as,
Ein(w) =1
N
N∑n=1
log
(1
Pw(yn|xn)
)=
1
N
N∑n=1
log
(1
h(xn)
)y = +1
1
N
N∑n=1
log
(1
1− h(xn)
)y = −1
or all in one line,
Ein(w) =1
N
N∑n=1
Jyn = +1K log
(1
h(xn)
)+ Jyn = −1K log
(1
1− h(xn)
),
This is the entropy representation of the in-sample error.
In order to use a first-order (i.e., gradient-based) method to find the optimal weight of the model, we need tocalculate the gradient of the in-sample error. This follows directly from the chain-rule.
∇Ein(w) =1
N
N∑n=1
1
1 + exp(−ynwTxn)exp(−ynw>xn)(−ynxn)
=1
N
N∑n=1
exp(−ynw>xn)
1 + exp(−ynwTxn)(−ynxn)
=1
N
N∑n=1
−ynxnθ(−ynw>xn)
2
Remark 2. Observe that when a piece of data is mis-classified (by definition, ynw>xn < 0), the argument of θ is
positive, therefore the output of θ is closer to 1. Otherwise, if it was correctly classified, then the output of θ iscloser to 0. This means that a method such as the gradient descent or the stochastic gradient desent is updatingthe model more aggressively upon encountering a misclassified point, and keep the model relatively the same whenthe weight is correctly classified.
To Summarize
The logistic regression model is shown in Figure 1. The encircled figure on top is the training phrase of thelogistic regression model. The gradient is fed into the gradient descent optimizer, and a decision rule is appliedto determine whether to continue to train. After the training has stopped (the weights are no longer moving), webuild our hypothesis θ(x) which outputs a probability. We may threshold the probability using the sign functionto determine the label.
wk+1 = wk − ηek
∇Ein(w) =1
N
N∑
n=1
−ynxnθ(−ynw⊤xn)
‖wk+1 − wk‖2≤ ǫ
w = wk+1
h(x) = θ(w⋆⊤x)w⋆ = wk+1
NO
YES
ek = ∇Ein(wk)
x1, . . . , xN y1, . . . , yN
x
Training
h(x) = Pr[y = +1|x] ∈ (0, 1)
Testing
Figure 1: Full block diagram representation of logistic regression. ε is the threshold for stopping the training, andη > 0 is a learning rate. All supervised learning can be represented using a block diagram such as the one above(ignoring validation).
This is in fact a model for a very basic neural network (as shown in tutorial). However, there are couple ofdifficulties extending this to a multi-layer neural network model. One is that the gradient is usually very difficultto compute. The gradient block at the very top is replaced with the so-called backpropagation block. Anotherissue is that most of the data are not streamed all at once, but one at a time (or in a mini-batch). If the data arestreamed one at a time, then we change our optimizer to the stochastic gradient optimizer. If the data is streamedin mini-batch, we use the mini-batch gradient descent.
Softmax Regression and some problems that were left out
Bolin Gao
Oct 10, 2019
Softmax Regression
The softmax regression (or multi-class logistic regression) is a generalization of the logistic regression. We are givena data set D = (xn, yn)Nn=1, where xn ∈ X = Rd, d ≥ 1, and yn ∈ Y = 1, . . . , c, c ≥ 2 where c is the number ofclasses that the data belongs to.
Officially, we seek a model from the following class of hypothesis,
H =
h : 1 × X → Y, x 7→ y
∣∣∣h(x) = argmaxi∈1,...,c
e>i σ(W>x),W =[w(1) . . . w(c)
], w(i) ∈ Rd+1,W ∈ Rd+1×c
(1)
where σ is the softmax function,
σ(W>x) =
exp(w(i)>xn)c∑
j=1
exp(w(j)>xn)
c
i=1
=1
c∑j=1
exp(w(j)>xn)
exp(w(1)>xn)...
exp(w(c)>xn)
. (2)
The hypothesis h : 1 × X → Y is constructed as follows:
1. First, we multiply a weight matrix W with an example x. Since W is a matrix, this yields a vector W>x =[w(1)>x . . . w(c)>x
]> ∈ Rc
2. Next, we feed this vector W>x into the softmax function σ, to form a probability vector σ(W>x) ∈ int(∆c) =v ∈ Rc|v1 + v2 + . . .+ vc = 1, vi > 0,∀i ∈ 1, . . . , c (this is the interior of the simplex in Rc).
3. We form the inner product between this probability vector σ(W>x) with a basis vector ei ∈ Rc, where ei has0 for all entries except a 1 at the ith entry.
4. Finally, we choose the ith entry as the value of i that maximizes the inner product e>i σ(W>x). This optimumi is the index associated with the largest entry of the the probability vector σ(W>x). Since i ∈ 1, . . . c,therefore it is a prediction of y.
Alternatively, we can treat our hypothesis simply as h(x) = σ(W>x). However, notice that this hypothesisfunction does not map to the target space Y = 1, . . . , c. So the output y = h(x) is not a prediction of y ∈ 1, . . . , c.Despite this issue, we will assume this hypothesis for the rest of this section.
Derivation of the Sample Error and Its Gradient
Assume the conditional probability of predicting y = i given x is given by,
Pr[y = i|x] = hi(x) =exp(w(i)>x)c∑
j=1
exp(w(j)>x), i ∈ 1, . . . , c
This give rise to a conditional CDF parameterized by the weights W ,
PW (y|x) =
h1(x) y = 1...
...
hc(x) y = c
= σy(WTx) = e>y σ(W>x) = e>y
w(1)>x...
w(c)>x
where ey =
[0 . . . 1 . . . 0
]>, and 1 occupies the yth position (this is also referred to as the one-hot encoding of y).
1
The optimal weight matrix W ? =[w(1)? . . . w(c)?
]>, is the solution to the maximum likelihood problem,
W ? = argmaxW∈Rd+1×c
N∏n=1
PW (yn|xn)
= argmaxW∈Rd+1×c
log(
N∏n=1
PW (yn|xn))
= argmaxW∈Rd+1×c
N∑n=1
log(PW (yn|xn))
= argmaxW∈Rd+1×c
N∑n=1
log
exp(w(yn)>x)c∑
j=1
exp(w(j)>x)
= argminW∈Rd+1×c
−N∑
n=1
log
exp(w(yn)>x)c∑
j=1
exp(w(j)>x)
Remark 1. The formulation of the likelihood is slightly different than the one presented in the tutorial, which wassomewhat based on Chris Bishop’s formulation of the softmax regression problem (Page 209, Pattern Recognitionand Machine Learning). Please make appropriate changes in your notes. Thanks Arnav Goel for alerting me to thisproblem. In any case, for us the summation term is not important, we will only consider the sample error (thingswithin the sum).
We then define the in-sample error as
Ein(W ) =1
N
N∑n=1
en(W ) (3)
where en(W ) is the (per-)sample error, given by,
en(W ) = − log
exp(w(yn)>xn)c∑
j=1
exp(w(j)>xn)
(4)
which we can rearrange to be,
en(W ) = − log(exp(w(yn)>xn) + log(
c∑j=1
exp(w(j)>xn)) = −w(yn)>xn + log(
c∑j=1
exp(w(j)>wn))
Next, we wish to compute the gradient of the sample error with respect to some weight vector w(i) ∈ Rd+1. Inorder to do so, we need to consider two cases,
en(W ) = −w(yn)>xn + log(
c∑j=1
exp(w(j)>wn) =
−w(i)>xn + log(
c∑j=1
exp(w(j)>wn)) yn = i
−w(l)>xn + log(c∑
j=1
exp(w(j)>wn)) yn = l 6= i
Then it is clear,
∇w(i)en(W ) =
−∇w(i)[w(i)>xn + log(
c∑j=1
exp(w(j)>wn))] yn = i
−∇w(i)[w(l)>xn + log(c∑
j=1
exp(w(j)>wn))] yn = l 6= i=
−xn +exp(w(i)>xn)c∑
j=1
exp(w(j)>xn)xn yn = i
exp(w(l)>xn)c∑
j=1
exp(w(j)>xn)xn yn = l 6= i
2
where we have used the fact, ∇w(i)
[log(
c∑j=1
exp(w(j)>wn))
]=
exp(w(i)>xn)c∑
j=1
exp(w(j)>xn)xn = σi(W
>xn)xn
We can express everything in a more succinct way,
∇w(i)en(W ) = −xnJyn = iK +exp(w(i)>xn)c∑
j=1
exp(w(j)>xn)xn.
One may also wish to combine the above two terms into a single term.
Feature Transform
The idea of the feature transform is simple, and we have seen some of it when we discussed polynomial regression.Suppose that we wish to perform binary classification using linear classifiers, however the data points in our
data set are not linearly separable. How can this be done?The intuitive approach is to rearrange the data in a way so that it becomes linearly separable. This operation
is called the feature transform.
Example 1. Suppose we have integer data xn = −10,−9,−8, . . . , 8, 9, 10 and yn =
−1 |xn| > 2
+1 otherwise.
Clearly the data is not linearly separable.
But it becomes linearly separable when we apply the transform Φ(x) =
1xx2
. This function maps the space of
our data X = −10,−9,−8, . . . , 8, 9, 10 into a new space Z.One possible linear classifier in the new space is (plot it!),
h(x) = sgn(w>Φ(x)) = sgn
[−5 0 1] 1x1x2
, w =[−5 0 1
]>Then given a new data x, we can simply output a prediction of the label using the new classifier h(x) =
sgn(w>Φ(x)).
Following the approach in the textbook, we find such a transform in the following way,
1. Find a nonlinear boundary in the space of the data X that separates the data
2. Re-write the nonlinear boundary in the form of a hyperplane w>Φ(x)
3. Obtain the feature transform Φ(x)
Exercise 3.13
In this exercise, we wish to find w to represent a list of boundaries.Taking one of the example, say (b),The circle centered at (3,4) is given by (x1 − 3)2 + (x2 − 4)2 = 1 =⇒ x21 − 6x1 + 9 + x22 − 8x2 + 17 = 1. We
then re-write this boundary as a hyperplane, 0 = x21 − 6x1 + 9 + x22 − 8x2 + 17− 1 = w>Φ(x)
In this case, Φ(x) is given as[1 x1 x2 x21 x1x2 x22
]>. So we obtain w =
[24 −6 −8 1 0 1
]>.
The other cases are dealt with similarly.
3
Problem 3.17
(a)
Recall that given a function f : Rn → R and a vector p ∈ Rn, the inexact Taylor series expansion is,
For Midterm Questions: Friday 3 pm - 4 pm BA4162 see Sindhu Gowda
History: Kunihiko Fukushima proposed the first convolutional neural network in 1980. Popularized by Yann LeCun and Yoshua Bengio et al. in 1998 for document recognition.
Alex Krizhevsky, Ilya Sutskever, Geoffery Hinton used AlexNet to win the competition
Almost halved the error achieved by top non-neural net based vision algorithms! (15.5% of the time the correct label was not one of the top 5 answers; compare with 5-10% humans)
This event revolutionized machine learning in many ways.
2012 ImageNet Challenge: best 5 out of 1,000 categories
2012 ImageNet Challenge: best 5 out of 1,000 categories
1. Popularized Rectified Linear Unit (ReLu), Dropout, Convolutional Neural Network
2. Beginning of “deep learning” hype
3. Popularized training with GPUs
4. Put UofT solidly on the map (in terms of machine learning and AI)
5. UofT ECE department started to consider teaching machine learning courses…the rest is history.
AlexNet is a Convolutional Neural Network
AlexNet is a Convolutional Neural Network
Convolution
Nonlinear mapping
Pooling
FlattenFully connected
(dense layer)Softmax
CNN for image classification (Base architecture)
Image
Repeat step 1 - 3
1
2
3
Output probabilities
Convolution
Nonlinear mapping
Pooling
FlattenFully connected
(dense layer)Softmax
CNN for image classification
Image
Repeat step 1 - 3
1
2
3
Output probabilities
Stochastic Gradient Descent
(or variants)
Compute gradient of loss function
(backpropagation)
(one-hot encoded) true label 𝑦
ො𝑦weight update 𝑤𝑘+1
weight update 𝑤𝑘+1
Convolution
Image 2x2 Filter
Slide a matrix(or filter, kernel) across the image, pairwise multiply each entry in the image with corresponding value of the matrix, then sum all values. This creates a new matrix.
Convolution
Image 2x2 Filter
In practice, filter elements or
“weights” are randomly initialized,
e.g., sampled from Gaussian.
Slide a matrix(or filter, kernel) across the image, pairwise multiply each entry in the image with corresponding value of the matrix, then sum all values. This creates a new matrix.
ConvolutionSlide a matrix(or filter, kernel) across the image, pairwise multiply each entry in the image with corresponding value of the matrix, then sum all values. This creates a new matrix.
Image 2x2 Filter
𝑂𝑢𝑡𝑝𝑢𝑡 𝑥, 𝑦 =𝑚
𝑛𝐼𝑚𝑎𝑔𝑒 𝑥 + 𝑚, 𝑦 + 𝑛 𝐹𝑖𝑙𝑡𝑒𝑟(𝑚, 𝑛)
where Image and Filter represents their respective matrices.
Mathematically (for this simple example):
Convolution
Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.
Image
Convolution
Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.
Image
Convolution
Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.
Image
Convolution
Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.
Image
Convolution
Slide a filter across the image, multiply each entry in the image with corresponding value of matrix, then sum all values. This creates a new matrix.
(misnomer: not a convolution as in signal proc., more like cross-correlation)
Image 2x2 Filter Output
Most crucial idea in CNN:
Each convolution implements a small neural network. Therefore filters are learned.
Convolution
Image 2x2 Filter
1
2
3
4
5
6
7
8
9
Σ 6
1
0
0
1
Most crucial idea in CNN:
Each convolution implements a small neural network. Therefore filters are learned.
Convolution
Image 2x2 Filter
1
2
3
4
5
6
7
8
9
Σ 6
Σ 8
Most crucial idea in CNN:
Each convolution implements a small neural network. Therefore filters are learned.
Convolution
Image 2x2 Filter
1
2
3
4
5
6
7
8
9
Σ 6
Σ 8
Σ
Σ
12
14
Why supervised learning
using CNN is called
Deep Learning
Convolution
Four 2x2 Filters
Image
In practice, several filters are used to learn different representations of the same data.
Convolution
Convolution
Convolution
Convolution
In practice, several filters are used to learn different representations of the same data.Convolution
Four 2x2 Filters
Image
Output
In practice, several filters are used to learn different representations of the same data.Convolution
Four 2x2 Filters
Image
Output Pictorial representation (both are equivalent)
This box
has
dimension
4 x 2 x 2
Explain.
ConvolutionReal images are represented by matrices stacked together
Each color specifies a “channel”. RGB image has three channels. BW image has two channels.
Assuming RGB image, we need to perform convolution on all three channels
2x2 Filters with depth 3Image in three RGB channels
For simplicity, assume all RGB layers have same value, all filter have the same value
Output
one matrix because we are convolving using one filter
2x2 Filters with depth 3
Pictorial representation
Input image Output
Length
Width
Depth (=3)
Filter
Convolution – stride
Stride: how many rows/columns the filter shifts by.
Some matrix 2x2 Filter (stride 1) 2x2 Filter (stride 2)
Nonlinear mapping
Convolution is a completely linear operation, want nonlinearity in network!
Add nonlinearity by sending each output of the convolution to a nonlinear function
Nonlinear mapping
Convolution is a completely linear operation, want nonlinearity in network!
Add nonlinearity by sending each output of the convolution to a nonlinear function
If you want to tackle complex problems, you need to design a verydeep neural network.
Training a deep neural net which contains hundreds of neurons andconnected by hundreds of thousands of connections is reallychallenging because
1 Vansihing gradients problem that affects DNNs and makes lower layersvery hard to train.
2 Due to number of parameters, training would be extremely slow.3 Risk of overfitting.
In this tutorial we will present available techniques in Python to solvethese problems.
3 / 15
Initialization
Vanishing Gradient Problem
Gradients get smaller and smaller as the training progresses down tothe lower layers.
There are two suspects for this problem : Initialization and ActivationFunction.
4 / 15
Initialization
Initialization
Variance of the outputs of each layer be equal to the variance of itsinput.It is not possible to guarantee unless the layer has an equal number ofinput and output connections.TensorFlow implementation
Figure: Initialization parameters for each type of activation function
K-means, Review of Probability, Jointly Gaussian Random
Variables
Bolin Gao
Nov 7, 2019
K-means and Lloyd’s Algorithm
We now discuss a classic unsupervised learning problem called K-means. Suppose we are given a set of data, wherewe assume that
(i) data xn ∈ Rd
(ii) data belong to K number of clusters (data that share commonality with each other)
(iii) similar data are close in the Euclidean distance
is it possible to partition the data in these K different clusters? The problem is formally given below.
K means clustering problem (set theoretic)
Given a data set D = xnNn=1, where xn ∈ Rd, find K clusters BkKk=1 = B1, B2, . . . , BK, each Bk ⊆ Rdand vectors µkKk=1, µk ∈ Rd, such that,
L(B1, . . . , BK , µ1, . . . , µK) =
K∑k=1
∑xn∈Bk
‖xn − µk‖22 (1)
is minimized.
The loss L is sometimes referred to as the distortion measure. We note that the variables here are the vectorsµ1, . . . , µk, as well as the sets B1, . . . , BK .
The following algorithm describes a popular method for minimizing the loss L, usually referred to as the K-meansalgorithm (K-means is the problem, there are various algorithms to solve this problem).
Lloyd’s Algorithm
Data: D = xnNn=1, xn ∈ RdInput: µkKk=1 set to some random values, e.g., those in D.Output: µkKk=1 corresponding to the optimal loss valueRepeat until convergence
1. For all n = 1, . . . , N , assign xn to the nearest µk, that is, compute
argmini=1,...,K
‖xn − µi‖2 = k ∈ 1, . . .K (2)
and assign xn to the set Bk.
2. For all k = 1, . . . ,K, compute µk via,
µk =1
|Bk|∑
xn∈Bk
xn, (3)
where |Bk| denotes the number of elements in the set Bk.
The vectors µkKk=1 are referred to as the “mean vectors” or sometimes the centroids. It is known that theK-means algorithm,
(i) The loss function is monotonically decreasing
(ii) No guarantee on the number of iterations to convergence
(iii) No nontrivial lower bound on the gap between value of K-means loss of algorithm output and the minimumachievable value of the loss
1
(iv) K means might converge to a point which is not a local minimum.
The recommendation is to run K means with different initialization and pick the best clustering.
Alternative description of K means problem (non-set theoretic)
While the previously mentioned algorithm is intuitive, it unfortunately requires us to associated vectors with thesets BkKk=1, which are not explicitly computed. We now offer an equivalent description of the K-means problem,as well as the Lloyd’s algorithm for this problem, whereby the membership is explicitly specified. The following istaken from Chapter 9 of Bishop’s book, which offers a non-set theoretic description of the K means problem.
K-means clustering problem (non-set theoretic)
Given a data set D = xnNn=1, where xn ∈ Rd. Find µkKk=1, µk ∈ Rd and responsibilities rn,kN,Kn=1,k=1,rn,k ∈ 0, 1, such that,
L(r1,1, . . . , rN,K , µ1, . . . , µK) =
K∑k=1
N∑n=1
rn,k‖xn − µk‖22 (4)
is minimized andK∑k=1
rn,k = 1,∀n.
Remark 1. For rn,K = 1 if xn is assigned to cluster K. The constraintK∑k=1
rn,k = 1 says that for each data xn, it is
only allowed to be assigned to a single class. We can gather all the responsibilities into a single vector,
rn = (rn,1, . . . , rn,K) =
rn,1...rn,K
(5)
Since only one of the numbers rn,1 is allowed to be 1, for example, rn = (0, 0, 1, 0, 0). Therefore, we say that rn isa “one-hot” encoding of the class membership of xn. Observe that each rn is a vertex of the simplex ∆.
Lloyd’s Algorithm (non set-theoretic)
Data: D = xnNn=1, xn ∈ RdInput: µKk=1 set to some random values, e.g., those in D.Output: µKk=1 corresponding to the optimal loss valueRepeat until convergence
Suppose that a data point we receive xn falls precisely on the point of intersection of two clusters, then eitherthe data point belong to one of the clusters (which we resolve perhaps using a coin flip), or perhaps it is moreappropriate to say that the data belongs to both clusters. The second method is so-called soft K-means.
Soft K-means clustering problem (non-set theoretic)
Given a data set D = xnNn=1, where xn ∈ Rd. Find µkKk=1, µk ∈ Rd and responsibilities rn,kN,Kn=1,k=1,rn,k ∈ [0, 1], such that,
L(r1,1, . . . , rN,K , µ1, . . . , µK) =
K∑k=1
N∑n=1
rn,k‖xn − µk‖22 (8)
is minimized andK∑k=1
rn,k = 1,∀n.
Remark 2. In this case, we may view rn,k as the “percentage” that a data xn belonging to cluster k. We can gatherall the responsibilities into a single vector,
rn = (rn,1, . . . , rn,K) =
rn,1...rn,K
(9)
Observe that each rn is an element of the simplex ∆.
Lloyd’s Algorithm for Soft K-means
Data: D = xnNn=1, xn ∈ RdInput: µKk=1 set to some random values, e.g., those in D.Output: µKk=1 corresponding to the optimal loss valueRepeat until convergence
1. For all n = 1, . . . , N , k = 1, . . . ,K,
rn,k =exp(−λ‖xn − µk‖22
)K∑l=1
exp(−λ‖xn − µl‖22)
= σk
−λ ‖xn − µ1‖22
...‖xn − µK‖22
, (10)
where λ > 0 and σk is the kth component of the softmax function.
2. For all k = 1, . . . ,K, compute µk via,
µk =
N∑n=1
rn,kxn
N∑n=1
rn,k
(11)
Algorithm 3: Lloyd’s Algorithm for Soft K-means (“Soft K-means algorithm”)
Note that there was a slight typo from last time. The Euclidean distance term inside of the exponential shouldraise to the power of 2.
Remark 3. Note the following,
• For xn close to µk, ‖xn − µk‖2 is small, hence rn,k → 1
• For xn far from µk, ‖xn − µk‖2 is big, hence rn,k → 0
3
Review of Probability and Jointly Gaussian Random Variables
Random Variable
Our (abridged) story of probability starts with the notion of a random variable.
Definition 1. A random variable is a function X : Ω→ R, ω 7→ X(ω).
Ω is referred to as the sample space, which consists of the collection of outcomes of some underlying randomexperiment. This set is very general, it can contain the name of all the students in the classroom. Randomvariables provide us a way to talk about these very general objects using mathematics. Common examples includeΩ = Head,Tail or Ω = [0,∞). The symbol ω denotes a single outcome in the sample space Ω. We refer to anysubset of E ⊆ Ω as an event. It is important to note that ω is an element of Ω, but not a subset of it. This beingsaid, the set containing ω, i.e., w is a subset of Ω. It is referred to as an elementary event.
We say that the random variableX is discrete, if it maps to a discrete set x1, . . . , xN (finite), or x1, . . . , xN , . . .(countable), and continuous if it maps to some interval C ⊆ R (uncountable).
Probability Measure
To talk about probability of certain outcomes or events, we need the notion of the probability measure
Definition 2. A probability measure is a function Pr : Ω→ [0, 1], E 7→ Pr(E), that satisfies the following properties
(i) Pr[E] ∈ [0, 1]
(ii) Pr[Ω] = 1 Pr[∅] = 0
(iii) If Ei ∩ Ej = ∅,∀i, j, then Pr[⋃∞n=1En] =
∞∑n=1
Pr[En].
The final property is called countable additivity. We note that a curious thing about the notion of a probabilitymeasure is that not all subsets E ⊂ [0, 1] (or any set of a continuum of numbers) has a defined Pr[E], therefore,not all subsets E ⊂ [0, 1] can be events! Such sets are called Vitali sets and its construction is intimately linked theAxiom of Choice, which is at the heart of the foundation of mathematics. For details, see page 74 of the text byAlberto Leon Garcia.
Some properties associated with the probability measure that can be derived,
Remark 4. There are two short hands used in probability that causes confusion for beginners,
(1) X is a function of ω, but in practice, ω is omitted, and one writes X = X(ω), where X is now a real number.This is of course bad practice, as we now confuse the function with the image of the function, but it hasstuck; virtually all applied probability textbook uses this notation. For example, when one write X = 2Y , weactually mean that for all ω ∈ Ω, X(ω) = 2Y (ω).
(2) Often times, when one wishes to talk about probability the random variable falling at, below, or within aninterval of some number, one write,
Pr[X = x] Pr[X ≤ x] Pr[x1 ≤ X ≤ x2] (12)
But this is strange, as Pr is a function of events (not statements such as “X ≤ x”). What we actually meanwhen we write these is, for example, Pr[X = x] = Prw ∈ Ω|X(w) = x.
Probability Mass Function, Cumulative Distribution Function, Probability DensityFunction
The three famous functions: probability mass function, cumulative distribution function and probability densityfunction are simply different ways of talking about the probability measure of certain events.
4
Probability Mass Function
Definition 3. The probability mass function (PMF) for a discrete random variable X is the function PX(x) =Pr[X = x] and for a pair of random variable X,Y is the function PX,Y (x, y) = Pr[X = x, Y = y].
The PMF is only defined for discrete random variable because for continuous random variable, the probabilityat any point x, Pr[X = x], is 0
Example 1. (Poisson RV)
PX(k) =αk
k!e−α, k = 0, . . . , E[X] = α,VAR[X] = α (13)
Cumulative Distribution Function
Definition 4. The cumulative distribution function (CDF) for a (discrete or continuous) random variable X is thefunction FX(x) = Pr[X ≤ x] and for a pair of random variable X,Y is the function FX,Y (x, y) = Pr[X ≤ x, Y ≤ y].
Example 2. (Exponential RV)
FX(x) =
1− exp(−λx) x ≥ 0
0 x < 0E[X] =
1
λ,E[Xk] =
k!
λk,VAR[X] =
1
λ2(14)
The CDF is non-decreasing and right-continuous. The limit of FX(x) as x→∞ is 1, and as x→ −∞ is 0.Here are some properties involving the CDF that might be good to be reminded of (you do not need to know
how they are derived).
Pr[X = x1] = FX(x1)− FX(x−1 )
Pr[x1 ≤ X ≤ x2] = FX(x2)− FX(x1) + Pr[X = x1]
And for pair of random variable,
Pr[X > x, Y > y] = 1−Pr[X ≤ x ∪ Y ≤ y] = 1−Pr[X ≤ x] + Pr[Y ≤ Y ]−Pr[X ≤ x ∩ Y ≤ y] = 1−Fx(x) +FY (y)− FXY (x, y)
Pr[x1 ≤ X ≤ x2, y1 ≤ Y ≤ y2] = FXY (x2, y2)−FXY (x1, y2)−FXY (x2, x1)+FXY (x1, y1)+Pr[X = x1, y1 ≤ Y ≤ y2]+Pr[x1 ≤ X ≤ x2, y = y1]
Pr[x1 < X ≤ x2, Y ≤ y2] = Pr[X ≤ x2, Y ≤ y2]− Pr[X ≤ x1, Y ≤ y2] = FXY (x2, y1)− FXY (x1, y1)
Probability Density Function
Definition 5. The probability density function (PDF) for a continuous random variable X is the function fX(x) =dFX(x)
dxand for a pair of random variable X,Y is the function fX,Y (x, y) =
∂2FX,Y (x, y)
∂x∂y.
Remark 5. We note that, if a random variable has a distribution fX(x), we write X ∼ fX(x). Confusingly, inmachine learning, the random variable is often written as a lower case character, x, and for the PDF, the subscriptis dropped, resulting in x ∼ f(x), or x ∼ p(x). So it is very important to have a clear understanding of what is therandom variable.
Some properties of the PDF:
fX(x) ≥ 0
Pr[x1 ≤ X ≤ x2] =x2∫x1
fX(x)dx
FX(x) =x∫−∞
fX(t)dt
1 =∞∫−∞
fX(x)dx
5
Example 3. (Exponential RV)
fX(x) =
λ exp(−λx) x ≥ 0
0 x < 0. (15)
Example 4. (Gaussian RV)
fX(x) =1√2πσ
exp
((x−m)2
2σ2
)E[X] = m,VAR[X] = σ2 (16)
The Gaussian random variable is often written as, N (m,σ2)
Theorem 1. Y = aX + b,X ∼ N (m,σ2) =⇒ Y ∼ N (am+ b, (|a|σ)2)
Theorem 2. If X,Y are independent and Gaussian, their sum Z = X + Y is also Gaussian.
Theorem 3. (Page 217, Papoulis, 4th Ed) (Cramer) If X,Y are independent, Z = X + Y is Gaussian, then Xand Y are both Gaussian.
Jointly Gaussian Random Variables
We close our review of probability with a description of the jointly Gaussian random variable.
Definition 6. We say that a pair of random variable X,Y is jointly Gaussian if
Theorem 4. Let X,Y be jointly Gaussian, then X,Y are independent (fXY (x, y) = fX(x)fY (y)) if and only ifCOV(X,Y ) = 0 (X,Y are uncorrelated).
Remark 6. This is not true if X,Y is not jointly Gaussian. For general random variables, X,Y independent impliesuncorrelated. But uncorrelated does not imply independent.
Theorem 5. If X,Y jointly Gaussian, their marginal PDFs are Gaussian.
Theorem 6. If X,Y jointly Gaussian, the conditional PDF of X given Y = y is Gaussian.
Theorem 7. Linear transform of jointly Gaussian is jointly Gaussian. That is, suppose X,Y are jointly Gaussianrandom variables, define random variables Z,W ,[
ZW
]= A
[XY
], A ∈ Rn×n (18)
then Z,W are jointly Gaussian.
Finally, we note that if X,Y are Gaussian, their joint distribution is not necessarily jointly Gaussian. Despitebeing a very important result, these examples usually take a bit effort to construct.