Alex ACM SC Machine Learning Day [Materials] | Introduction to Machine Learning By Eng. Ibrahim Sabek

7/30/2019 Alex ACM SC Machine Learning Day [Materials] | Introduction to Machine Learning By Eng. Ibrahim Sabek

1/33

Introduction to Machine Learning

Ibrahim Sabek

Computer and Systems Engineering Department,Faculty of Engineering,

Alexandria University, Egypt

1 / 3 3


2/33

Agenda

1 Machine learning overview and applications2 Supervised vs. Unsupervised learning

3 Generative vs. Discriminative models

4 Overview of Classification

5 The big picture

6 Bayesian inference

7 Summary

8 Feedback

2 / 3 3


3/33

Machine learning overview and applications

What is Machine Learning (ML)?Definition: algorithms for inferring unknowns from knowns.

What do you mean by inferring ??How to get unknowns from knowns??

3 / 3 3


4/33




ML applications

Spam detection

Handwriting detectionSpeech recognitionNetflix recommendation system

4 / 3 3


5/33




ML applications

Spam detection

Handwriting detectionSpeech recognitionNetflix recommendation system

Classes of ML models

Supervised vs. Unsupervised.Generative vs. Discriminative

5 / 3 3


6/33

Supervised vs. Unsupervised learning

Supervised vs. UnsupervisedSupervised: Given (x1,y1), (x2,y2), ......, (xn,yn), choose a

function f(xi) = yi

xi R2, xi = data pointsyi = class/value

6 / 3 3

S i d U i d l i


7/33



function f(xi) = yi

xi R2, xi = data pointsyi = class/valueClassification: yi {finite set}

Regression: yi

R

7 / 3 3

S i d U i d l i


8/33



function f(xi) = yixi R2, xi = data pointsyi = class/valueClassification: yi {finite set}Regression: yi R

8 / 3 3

Supervised vs Unsupervised learning


9/33



function f(xi) = yixi R2, xi = data pointsyi = class/valueClassification: yi {finite set}Regression: yi R

9 / 3 3



10/33


Supervised vs. UnsupervisedUnsupervised: Given (x1, x2, ..., xn), find patterns in the data.

xi R2, xi = data points

10/33



11/33



xi R2, xi = data pointsClusteringDensity estimationDimensional reduction

11/33



12/33




12/33



13/33

p p g



13/33



14/33

p p g



14/33



15/33

Variations on Supervised and Unsupervised

Semi-supervised: Given

(x1,y1), (x2,y2), ......, (xk,yk), xk+1, xk+2, ..., xn, predictyk+1,yk+2, ...,yn

15/33



16/33


Semi-supervised: Given

(x1,y1), (x2,y2), ......, (xk,yk), xk+1, xk+2, ..., xn, predictyk+1,yk+2, ...,yn

Active learning:

16/33



17/33


Decision theory: measure the prediction performance of

unlabeled data

17/33



18/33


Decision theory: measure the prediction performance of

unlabeled dataReinforcement learning:

maximize rewards (minimize losses) by actionsmaximize overall lifetime reward

18/33

Generative vs. Discriminative models


19/33


Given (x1,y1), (x2,y2), ......, (xn,yn), and a new point (x,y)

19/33



20/33



Discriminative:

you want to estimate p(y = 1|x), p(y = 0|x) for y {0, 1}

20/33



21/33



Discriminative:

you want to estimate p(y = 1|x), p(y = 0|x) for y {0, 1}

Generative:

you want to estimate the joint distribution p(x, y)

21/33

Overview of Classification


22/33

k-Nearest Neighbor classification (kNN)

Given D = {(x1,y1), (x2,y2), ......, (xn,yn)}, and a new point (x,y)

where xi R,yi {0, 1}

Dissimilarity metric: d(x, x) = ||x x||2 for k = 1Probabilistic interpretation:

Given fixed k, p(y) = fraction of pts xi in Nk(x) s.t. yi = y

y = argmaxyp(y|x, D)22/33



23/33

Classification trees (CART)

Given D = {(x1,y1), (x2,y2), ......, (xn,yn)}, and a new x where

xi R,yi {0, 1}You build a binary tree

Minimize error in each leaf

23/33



24/33

Regression tress (CART)

Given D = {(x1,y1), (x2,y2), ......, (xn,yn)}, and a new x where

xi R,yi R

24/33



25/33

Bootstrap aggregation (Bagging)

Given D = {(x1,y1), (x2,y2), ......, (xn,yn)} follows P iid , and a

new x where xi R,yi R, we need to find its y value

25/33



26/33

Bootstrap aggregation (Bagging)

Given D = {(x1,y1), (x2,y2), ......, (xn,yn)} follows P iid , and a

new x where xi R,yi R, we need to find its y value

Intuition: averaging makes your prediction close to the truelabel

Different training datasets , (xik,yik) follows uniform (D) iid.

The final label y is the average of generated labels from thedifferent datasets.

26/33



27/33

Random forests

Given D = {(x1,y1), (x2,y2), ......, (xn,yn)} where xi R,yi R

For i = 1, ..., BChoose bootstrap sample Di from DConstruct tree Ti using Di s.t. at each node choose randomsubset of features and only consider splitting on these features.

Given x, take majority vote (for classification) or average (forregression).

27/33

The big picture


28/33

The big picture

Given the expected loss function EL(y, f(x)) and

D = {(x1,y1), (x2,y2), ......, (xn,yn)} where xi R,yi R, we wantto estimate p(y|x)

Discriminative: Estimate p(y|x) directly using D.

KNN, Trees, SVM

Generative: Estimate p(x,y) directly using D. and thenp(y|x) = p(x,y)

p(x) , also we have p(x, y) = p(x|y)p(y)

Params/Latent variables : by including parameters, we havep(x,y|)

for discrete space: p(y|x, D) =

p(y|x, D, )p(|x, D)p(y|x, D, ) is nicep(|x, D) is nasty (called posterior dist. on )summation (or integration in case of continuous space) is nastyand often intractable

28/33

The big picture


29/33

The big picture

p(y|x, D) =

p(y|x, D, )p(|x, D)

Exact inference:Multi-variate Gaussian.Graphical models

Point estimate of

Maximum Likelihood Estimation (MLE)Maximum A Prior (MAP)Est. = argmaxp(|x, D)

Deterministic Approximation

Laplace Approx.Variational methods

Stochastic Approximation

Importance samplingGibbs sampling

29/33

Bayesian inference


30/33

Bayesian inference

Put distributions on everything and then use rules of probability

to infer valuesAspects of Bayesian inference

Priors: Assuming a prior distribution p()Procedures: Minimizing expected loss (averaging over )

Pros.:Directly answer questions.Avoid overfitting

Cons.:

Must assume prior.

Exact computation can be intractable

30/33

Bayesian inference


31/33

Directed graphical modelsBayesian networks or Conditional independ. diagram:

Why? Tractable inference.Factorization of the probabilistic model.Notational deviceVisualization for inference algorithmsExample for thinking graphically p(a, b, c):

p(a, b, c) = p(c|a, b)p(a, b) = p(c|a, b)p(b|a)p(a)

31/33

Summary


32/33

Summary

Machine learning is an essential field for our life.

Machine learning is a broad world, we just started it in thissession :D :D.

32/33

Feedback


33/33

Feedback

Your feedback is welcomed on alex.acm.org/feedback/machine/

33/33

Alex ACM SC Machine Learning Day [Materials] | Introduction to Machine Learning By Eng. Ibrahim Sabek

Documents