C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING WITH SAS WORKSHOP
GETTING THE MOST OUT OF YOUR DATA
Longhow Lam
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
AGENDA AND SOME READING MATERIAL
Intro & positioning of Machine learning
SAS platform for Machine learning
Overview of Specific methods
Some examples
Further reading
An experimental comparison of classification techniques for imbalanced
credit scoring data sets using SAS® Enterprise Miner
http://support.sas.com/resources/papers/proceedings12/129-2012.pdf
Benchmarking state-of-the-art classification algorithms for credit scoring: A ten-year update
http://www.business-school.ed.ac.uk/waf/crc_archive/2013/42.pdf
An absolute recommender for more detail:
The elements of statistical learning, Hasting, Tibshirani & Friedman
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
LONGHOW LAM SHORT BIO
MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs. wiskunde)
MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)
10+ year SAS experience (Base / Stat / Guide/ Miner / VA / VS)
10+ year R experience ( An introduction to R)
10 + year predictive modeling experience
ABNAMRO – Risk modeler
Basel, Credit risk, ALM models
Business&Decision – Quantitative consultant
ING Belgium, Fortis
Leaseplan, Belgium Post
Experian – data mininer
Collection Score, Delphi credit score, consulting
@longhowlamFollow me:
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
INTRO MACHINE LEARNING
Wikipedia:
“Machine learning is a scientific discipline that deals with the construction
and study of algorithms that can learn from data. Such algorithms operate by
building a model based on inputs and using that to make predictions or
decisions, rather than following only explicitly programmed instructions.”
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statistical
modeling
Supervised
Learning
Clustering
Unsupervised
Learning
Data mining
Machine
learningDimension
reduction
Association
rulesRecommender
Auto
encoders
Self
organizing
maps
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SAS SOFTWARE
FOR MACHINE LEARNING (AND DATA MINING)
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
IDENTIFY /
FORMULATE
PROBLEM
DATA
PREPARATION
DATA
EXPLORATION
TRANSFORM
& SELECT
BUILD
MODEL
VALIDATE
MODEL
DEPLOY
MODEL
EVALUATE /
MONITOR
RESULTSSAS In-Database Scoring
SAS Decision Manager
BUSINESS
MANAGER
SAS Model Manager
IT SYSTEMS /
MANAGEMENT
SAS Enterprise Guide
BUSINESS
ANALYST
Enterprise Miner / Text Miner
SAS IMSTAT / Recommender
DATA MINER /
DATA SCIENTIST
THE ANALYTICS
LIFECYCLE
SAS Visual Analytics
SAS Visual Statistics
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata
structure = markovblanket;
model default = x1 LTV income age;
selction = Y
RUN;
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING
Machine Learning algorithms designed to run on single
blade or multi blade distributed memory environments
HIGH PERFORMANCE
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Manage
Rules + Data + Models
Deployment flexibility:
Batch
Real Time
Stored Process
In Database
Drive Reuse and
Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PREDICT SOMEONE’S INCOME
Income = 15.2 + 1.102 × Age
Age
Income
Predict someones income from his/her age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING?
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear: X2, X3, Log(X), Sqrt(X), 1/X ,…….?
You do not have one input variable: X1, X2, X3,……X567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING WHY IT CAN MATTER € € €
Suppose we have an untargeted direct mailing of 100.000 ‘letters’ to randomly
sampled prospects:
Conversion rate is around 1%. Profit per conversion €80, Cost per mailing is €0.70
Total ROI = 100.000 X 1% X € 80 − 100.000 X € 0.70 = € 10,000
Now we have a targeted mailing with a machine learning predictive model, that uses
prospect input data that can distinguish between high / low responders.
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING WHY IT CAN MATTER € € €
Decile N Conversion Profit Cumulative
1 10.000 2.00% 9.000 9.000
2 10.000 1.50% 5.000 14.000
3 10.000 1.00% 1.000 15.000
4 10.000 1.00% 1.000 16.000
5 10.000 1.00% 1.000 17.000
6 10.000 1.00% 1.000 18.000
7 10.000 1.00% 1.000 19.000
8 10.000 0.80% -600 18.400
9 10.000 0.50% -3.000 15.400
10 10.000 0.20% -5.400 10.000
The profit by using a model to sent
letters only to the first 7 deciles is now:
€ 19.000 (instead of € 10.000)
If you have 100 of such campaigns a
year that means an increase of
€ 0.9 mln !!
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING WHY IT CAN MATTER € € €
Decile N Conversion Profit Cumulative
1 10.000 3.00% 17.000 17.000
2 10.000 2.00% 9.000 26.000
3 10.000 1.40% 4.200 30.200
4 10.000 1.15% 2.200 32.400
5 10.000 1.00% 1.000 33.400
6 10.000 0.60% -2.200 31.200
7 10.000 0.40% -3.800 27.400
8 10.000 0.30% -4.600 22.800
9 10.000 0.10% -6.200 16.600
10 10.000 0.05% -6.600 10.000
The profit by using a much better model
to sent letters only to the first 5 deciles
is now:
€ 33.400 (instead of € 10.000)
If you have 100 of such campaigns a
year that means an increase of
€ 2.34 mln !!
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MACHINE LEARNING WHY IT CAN MATTER? € € €
Decile N Conversion Profit Cumulative
1 10.000 3.35% 19.800 19.800
2 10.000 2.23% 10.840 30.640
3 10.000 1.30% 3.400 34.040
4 10.000 1.10% 1.800 35.840
5 10.000 1.00% 1.000 36.840
6 10.000 0.55% -2.600 34.240
7 10.000 0.28% -4.760 29.480
8 10.000 0.25% -5.000 24.480
9 10.000 0.05% -6.600 17.880
10 10.000 0.02% -6.840 11.040
Now lets suppose we have even a
slightly better model than the last one
€ 36.840
If you have 100 of such campaigns a
year that means an increase of
€ 2.68 mln !!
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
OVERVIEW OF SPECIFIC
MACHINE LEARNING METHODS
Classical regression
Decision trees
Dimension reduction
Bagging & Boosting
Support vector machines
K-Nearest Neighbour
Neural networks / deep learning
Bayesian networks
Text mining
Recommendation engine
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
“CLASSICAL” REGRESSION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
LINEAR & LOGISTIC REGRESSION
Income = a + b × Age
Age
Income
Age
P(Churn)
1
0
P(Churn) = 1
1+𝐸𝑋𝑃(𝑎+𝑏 × Age)
Numeric target variable Binairy target variable
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relation
• Transformation of inputs: X2 , X3 , log(X) etc…
• Buckets / binning of variables
Y / logit(y)
X
Smoothing Splines
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines: Piecewise polynomials that are glued together at knots
Two special cases for λ:
λ = 0 Any function that interpolates the data
λ = ∞ Simple Least square line fit
Choose λ by cross validation
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site. For many cars we have the
kilometres driven and the car price. For the Opel Astra we have 2360 cars:
What is the relation between km driven and car sales price?
Too much smoothing and too little smoothing
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
0.2 is the optimal smoothing paramter
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Some other car make/models with
spline estimates of car depreciation
versus kilometres driven.
Hmmm.. my Renault Clio looks nice
but after 50.000 km I only have 46%
of the original value left…
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MODELING NON LINEARITIES
In SAS we have TPSLINE, LOESS and the ADAPTIVEREG procedure
to fit multivariate regression splines
Supports:
More than one input
linear, logistic, Poisson, GLM regressions
combines both regression splines and model selection methods.
supports partitioning of data into training, validation, and testing roles
SPLINE REGRESSION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DECISION TREES
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DECISION TREES
How does it work? A simple example
Suppose we have the following group of people
50% Response
50% No Response
We have/know Age and Marital Status
50%
50%
Age≤ 45 Age> 45
30%
70%
60%
40%
Married
Divorced UnMarried
20%
80%
60%
40%
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DECISION TREES REGRESSION & CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 1.2 X
N 21 B 456 1.5 X
Y 32 A 545 1.3 U
Y 34 C 443 1.1 U
N 23 A 345 1.7 U
N 13 B 567 1.2 X
N 45 A 654 1.9 X
… … … … … …
… … … … … …
Y 46 A 657 2.1 X
A recursive splitting algorithm:
1. Loop trough all inputs
2. Determine per input how to split
3. Take the best input to split
4. On the two new data sets apply 1,2,3 again….
5. Stop somewhere….
• How to split X1 or X2 ?
• When to stop?
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DECISION TREES
How to split?Number is usualy 2 or 3.
More splits will exhaust the data too fast
Why split X1 <t1 beter dan X1 <s1?
Regression: Mean squared Error
Classification:
Mis-classification rate,
Cross-entropy, Chi-Squared
Regression tree: Mean square error
..
...
.. . .
...
.. .
.
Split s1 Split t1x
Y Y
x
REGRESSION & CLASSIFICATION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DECISION TREES
How to split?Number is usualy 2 or 3.
More splits will exhaust the data too fast
Why split X1 <t1 beter dan X1 <s1?
Regression: Mean squared Error
Classification:
Mis-classification rate,
Cross-entropy, Chi-Squared
Classification tree: Mis classificatie rate
xSplit s1 Split t1
REGRESSION & CLASSIFICATION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Decision trees (regressie & classificatie)
When to stop?
Not too early not too late!
Pruning
Remove parts the tree
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C4.5 / C5.0
CART (Classification and Regression)
The difference is mainly in the different splitting options
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Decision trees pros and cons
pros
Interaction between variables
Interpretable rules Missing values easy to incorporate.
cons
Unstable
“Lack-of-Smoothnes” Fit of obvious (non)linear relations
man vrouw
Inkomen < 45 K Leeftijd < 33
Response rate
Opel Astras
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DIMENSION REDUCTION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PRINCIPLE
COMPONENTSANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that
The largest variance is in the first coordinate
The second largets variance is in the second coordinate
Etc…
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PRINCIPLE
COMPONENTSANALYSIS
X1
X2
x x x x x x x
x
x
x
x
x
x
x
x
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PRINCIPLE
COMPONENTSANALYSIS
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PRINCIPLE
COMPONENTSANALYSIS
The Math behind
P = X W
𝑝11 𝑝21...
.
.
.𝑝1𝑛 𝑝2𝑛
=
𝑥11 𝑥21...
.
.
.𝑥1𝑛 𝑥2𝑛
𝑤11 𝑤21
𝑤12 𝑤22
w11 and w12 are the loadings corresponding to the first principle component.
w21 and w22 are the loadings corresponding to the second principle component.
With two dimensions In general
It turns out that the columns of W
Are the eigenvalue vectors of the matrix XTX
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PRINCIPLE
COMPONENTSANALYSIS
Scaling the inputs is important here
Applications of PCA
Dimension reduction
Visualisation
Outlier / anomalie detectie
PCA regression
Use PC instead of the original inputs
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PRINCIPLE
COMPONENTSDIMENSION REDUCTION
P = X WNow only take the first L columns of W
PL = X WL
For example for visualization only use the first
2 or 3 columns so that PL only has 2 or 3
columns that can be visualized in scatter or
contour plots
X
W
P=
XWL
PL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition:
Diagonal with r singular values
[ could be a large number]UA
VT
═ Σ
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition:
Diagonal with r singular values
[ could be a large number]UA
VT
═ Σ
Take only k << r singular values
Uk
Ak
VTk
═
Σk
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original
2448 X 3264 ~ 8 mln numbers
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD: 15 largest SV’s
1% of the data
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD: 75 largest V’s
5% of the data
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
VARIABLE
CLUSTERING TO REDUCE THE DIMENSION
Variabele selection
I have 500 inputs but maybe there are only ten clusters of inputs
Within 1 cluster the variables are (strongly) correlated.
Then use only 1 input per cluster for predictive modeling
X1, X2, X3, ….., X500
X1, X21, X35, X430,….. X35
X17, X29, X353, X490,…. X29
X37, X95, X251, X393,…. X251
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
VARIABLE
CLUSTERING TO REDUCE THE DIMENSION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
VARIABLE
CLUSTERING TO REDUCE THE DIMENSION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
BAGGING & BOOSTING
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
COMBINE MODELS BAGGING & BOOSTING
If one model is not good enough: let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough and have some predictive power
Random
sample
Final
modeldata
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Bagging & Boosting: Random Forests
Random forests ≈ Bagging with trees
Apply underlying steps repeatedly
1. Generate a bootstrap sample
2. Choose randomly m inputs m << P
3. Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree:
The random forest prediction is the majority vote of all trees
In case of a regression tree:
The random forest prediction is the average of all trees
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100
sub trees) fitted on the simulated data
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions.
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
GRADIENT BOOSTING DON’T LET THE FORMULAS INTIMIDATE YOU
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
GRADIENT
BOOSTINGSCHEMATIC OVERVIEW
Gradient Boosting, M iterations m = 1,2,…,M
Inputs
xr1
Final
model FM… M
At each succesive iteration a base learner hm(which is a decision tree) is fit on the pseudo residuals
using inputs x to “correct” the previous learner.
Pseudo residuals rim at each step
r2rM
Inputs
x
Inputs
x
Fm = Fm-1 + γ·hm
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SUPPORT VECTOR MACHINES
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Support vector machines (SVM)
Suppose we have a separable classification problem.
Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
“The kernel trick”
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Support vector machines (SVM)
Suppose we have a separable classification problem.
Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
“The kernel trick”
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Support vector machines (SVM)
Suppose we have a separable classification problem.
Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
“The kernel trick”
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMS
Separable classification
Non Separable classification
Non Separable classification rewritten using
Lagrange Dual problem
Kernels to model nonlinear behaviour
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
https://www.youtube.com/watch?v=3liCbRZPrZA
Linear not separable, but in 3D space they are!
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
K – NEAREST NEIGHBOUR
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
K-NN METHOD
• No model is fitted. Given a query point x0 , find the k points x1, x2,..., xk that are
closest in distance to x0.
• Classify x0 using the majority vote among the k neighbours
x05 nearest neighbours of x0
3 of them are red
2 of them are green
so we predict x0 to be red
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity, k-nearest-neighbors has been
successful used in problems like
• handwritten digits,
• Satellite image scenes
• EKG patterns
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site
For 108K Dutch postal codes (out of 463K) there are one or more houses for sale.
How can we estimate the house value for the postal codes without a house price?
For a Postal code with no price estimate the price
by taking the k closest house for sale prices.
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Comparing different nearest neighbours in SAS Enterprise Miner
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
K-NN EXAMPLE DUTCH HOUSE PRICES
30% of the data was used as validation set
In Enterprise Miner different values for k were used
k=5 nearest neighboor has the lowest Average squared error
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETWORKS
DEEP LEARNING
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETWORK LINEAR REGRESSION
f Y = f(X,w) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4w4
w3
w1
w2 Neural network compute node
f is the so-called activation function.
This could be the logit function, but
other choices are possible
There are four weights w’s that have
to be determined
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
P Y X) = 𝑔 𝑇𝑌
𝑇𝑌 = 𝛽0𝑌 + 𝛽𝑌𝑇𝑍
𝑍𝑚 = 𝜎 𝛼0𝑚 + 𝛼𝑚𝑇 𝑋
De functions g and σ are defined as
𝑔 𝑇𝑌 =𝑒𝑇𝑌
𝑒𝑇𝑁+𝑒𝑇𝑌, 𝜎(𝑥) =
1
1+𝑒−𝑥
In case of a binary classifier 𝑃 𝑁 𝑋 = 1 − 𝑃(𝑌|𝑋)
The model weights α and β have to be estimated from the data
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wi’ s
For each data point (observation)
1. Calculate the neural net prediction
2. Calculate the error E (for example: E = (actual – prediction)2)
3. Adjust weights w according to:
4. Stop if error E is small enough.
𝑤𝑖𝑛𝑒𝑤 = 𝑤𝑖 + ∆𝑤𝑖
∆𝑤𝑖 = −𝛼𝜕𝐸
𝜕𝑤𝑖
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETS AUTOENCODERS
http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layer
For visualisation
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETS AUTOENCODERS
http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NET CARS EXAMPLE
2 dimensional PCAAutoencoder network
25 – 15 – 2 – 15 – 25
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETS AUTOENCODER EXAMPLE
• 1000 images of digits
• Each image has 400 pixels
• So a 400 dimensional input vector X = (x1,…,x400)
• Compare two dimensional PCA with an neural net auto encoder
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
NEURAL NETS AUTOENCODER EXAMPLE
proc neural
data= autoencoderTraining
dmdbcat= work.autoencoderTrainingCat;
performance compile details cpucount= 12 threads= yes;
/* DEFAULTS: ACT= TANH COMBINE= LINEAR */
/* IDS ARE USED AS LAYER INDICATORS – SEE FIGURE 6 */
/* INPUTS AND TARGETS SHOULD BE STANDARDIZED */
archi MLP hidden= 5;
hidden 300 / id= h1;
hidden 100 / id= h2;
hidden 2 / id= h3 act= linear;
hidden 100 / id= h4;
hidden 300 / id= h5;
input corruptedPixel1 - corruptedPixel400 / id= i level= int std=
std;
target pixel1-pixel400 / act= identity id= t level= int std= std;
/* BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM */
initial random= 123;
prelim 10 preiter= 10;
run;
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Two dimensional representation of 400 dimensial ‘digit’ data
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
BAYESIAN NETWORKS
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
• Nodes represent random variables,
• Links between nodes represent conditional dependencies,
• Conditional probabilty tables are derived from training data for each node,
• Random variables are typically
binary or discrete,
• The graph structure can be
learned from the data,
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
TEXT MINING
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
TEXT MINING BASICS
“Advanced” word counting
Parse & Filter Part of speech
Entity detection
Mixed / numeric / abbrev.
Stemming
Spell checks, Stop list, Synonim list
Multi-term words
Apply Traditional data mining Clustering
Prediction / machine learning
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
TEXT MINING BASICS
Document 1: “Ik loop over straat in Amsterdam, 1057DK, met mijn fiets”
Document 2: “Zij liep niet maar fietste met haar blauwe fieets, //bitly.com/sdrtw”
Document 3: “Mijn tweewieler is kapot, wat een slecht stuk ijzer, @#$%$@!”
Terms Doc 1 Doc 2 Doc 3
+Fiets (znmw) 1 1 1
Fietsen (ww) 0 1 0
Blauwe (bvg) 0 1 0
Amsterdam (locatie) 1 0 0
+Lopen (ww) 1 1 0
Straat (znmw) 1 0 0
Kapot (bijw) 0 0 1
Slecht 0 0 1
Stuk Ijzer 0 0 1
1057DK (postcode) 1 0 0
//bitly.com/sdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX: A
• Each text document is (very) long vector
of word counts (often with many zeros!)
• Apply further mining on this matrix A.
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document
matrix
• Often more terms than documents
• Rows could be strongly correlated
• Matrix is often very sparse
Apply Singular value decomposition first.
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector 𝑑,
say of length 300.
Matrix SVD decompositie:
Diagonal with r singular values
[ could be many thousands ]UA
VT
═ Σ
take only the first k << r singular values
Uk
Ak
VTk
═
Σk
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn / fraud)
Apply machine learning to create
a model f to predict the target
Automatically generate topics within large document collections
Apply clustering techniques to classify
documents into clusters (topics)
Topic 1 Topic 2 Topic 3
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RECOMMENDATION ENGINE
Which product should I recommend my customers?
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RECOMMENDATION
ENGINE USER – ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly
Matrix is often very sparse
1 mln users 100K items ~ 0.01%??
User - Item Matrix – DataItem 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5
User 2 - - - 1 1
User 3 1 - 2 5 -
User 4 - - 1 2 5
User 5 2 1 4 2 3
User 6 2 3 - 5 1
User 7 5 1 - 3 4
User 8 - 1 - 4 1
User 9 2 3 2 4 2
User 10 - 1 3 - 1
User 4's Item RatingsUser 4 - - 1 2 5
After some math…. recommendations are: User 4 3.21 4.82 1 2 5
Recommend item 2!
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RECOMMENDATION
ENGINE ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms
Slope one (slope1)
K nearest neighbors (knn)
Model-based algorithms
Matrix factorization (SVD - LBFGS)
Market basket analysis
Association rules mining (arm)
Mixture of different methods
Clustering(cluster)
Ensemble
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1;
See notes
Item-item based
𝑟𝑢𝑖 = 𝑗 𝑤𝑖𝑗𝑟𝑢𝑗
𝑗 𝑤𝑖𝑗
Weight wij: the number of users having rated both items i and j;
Rating ruj : the average rating computed from item j;
Sample rating database
Customer Item A Item B Item C
John 5 3 2
Mark 3 4 ??
Lucy ?? 2 5
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings “in the neighborhood”
𝑟𝑢𝑖 = 𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚𝑖𝑗𝑟𝑢𝑗
𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚𝑖𝑗
How to determine the neighbors and how many (k) to use?
How to compute the similarity/distance measure 𝒘𝒊𝒋
• Pearson’s correlation coefficient
• Cosine distance
• Other adjustments
Similarity w
Neighbors N
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHODS
PEARSON CORRELATION
𝑎, 𝑏 : users
𝑟𝑎,𝑝 : rating of user 𝑎 for item 𝑝
𝑃 : set of items, rated both by 𝑎 and 𝑏
• Possible similarity values between −1 and 1
𝒔𝒊𝒎 𝒂, 𝒃 = 𝒑 ∈𝑷(𝒓𝒂,𝒑 − 𝒓𝒂)(𝒓𝒃,𝒑 − 𝒓𝒃)
𝒑 ∈𝑷 𝒓𝒂,𝒑 − 𝒓𝒂𝟐
𝒑 ∈𝑷 𝒓𝒃,𝒑 − 𝒓𝒃𝟐
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHODS K NEAREST NEIGHBORS METHOD
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data?
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem
L-BFGS
ALS
users
items
𝑅𝑖𝑗 = 𝑈𝑖𝑇𝑉𝑗Predict New Rating R:
Minimize prediction error: min𝑢,𝑣
𝑖,𝑗
(𝑅𝑖𝑗−𝑈𝑖𝑇𝑉𝑗)
2 + 𝜆( 𝑈𝑖2 + 𝑉𝑗
2)
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHODS CLUSTER
Knn within
one subgroup
User/item
profile
User/item
rating
Predictions
Clustering
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining
Identify frequent itemsets (rules) in the transaction data:
IF item A and B THEN item C
IF item X THEN item Y
Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule
# trxs. {X} {Y}
Total # trxs.
Support (X,Y) =
Lift = Support (X,Y)
Support (X) * Support(Y)
Support & LiftDiapers Beer 0.8%
Diapers Candles 0.018%
For example a lift of 2.5 means:
If people have X they are 2.5 more likely
to buy Y than if they don’t have X
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PROC RECOMMEND recom = rs.IENS;
* Add a recommendation system;
ADD rs.IENS /item = item user = user rating = rating;
* Add tables;
ADDTABLE LHL1209.IENS_UIR / recom = rs.IENS type = rating vars=(item user rating);
* Method SVD LBFGS met 20 factoren ;
METHOD svd /
factors = 20
label = "svd" fconv = 1e-3
gconv = 1e-3 maxiter = 100
MAXFEVAL = 5000 function = L2
lamda = 0.2
technique = lbfgs;
RUN;
METHOD ARM /
label = "ARM" ;
RUN;
/* information on the recommender system */
INFO;
QUIT;
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
/** prediction with the SVD method ***/
PROC RECOMMEND recom = rs.IENS;
PREDICT /
method = svd
label = "svd"
Num = 3
users = ("Longhow Lam");
run;
QUIT;
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
LAST SLIDE
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
OF MORE MODERN MACHINE LEARNING
CONS Unfamilar with broader audiance, (more) difficult to explain
Black box approach (you are rejected: The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithm…)
Interactions often “automatically” taken into account
Superior for Text mining, Image & Speech recognition
Better lift possible (paar procent “gratis”)
It allows you to not think about the business problem
(compared to traditional linear /logistic regression)
PROS AND CONS
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
WHY SAS FOR MACHINE LEARNING
• Many different techniques
• Easy to use GUI’s combined with flexible coding
• High performance scalability
• Easy Deployable
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SOME MACHINE LEARNING EXAMPLES
Text mining
Image recognition
Sound recognition
Strange faces
So can a machine read, see and hear?
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
PREDICTING SENTIMENT FROM
RESTAURANT REVIEWS
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
IENS REVIEWS COLLECTED AROUND 16.000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews,
and transform reviews to data points in SVD space.
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Predicted review score vs. Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 0.5
R2 Neural Net = 0.6
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
Predicted review score vs. Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 0.5
R2 Neural Net = 0.6
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
IENS REVIEWS APPLY MODEL ON ‘NEW REVIEWS’
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MNIST DATA IN SAS
MODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MNIST TRAINING DATA
42.000 pictures of hand-written digits
Each digit is a picture of 28 by 28 pixels
So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 – Nearest Neighbour has the lowest misclassification
rate. 3.6% of the digits in the validation set are mis
classified.
70/30 training/validation split
PCA regression on 50 largest PC’s
Seven singel layer neural nets: 3, 6, 12, 24,
48, 100, 200 neurons
Seven multi layer neural nets
Three Random forest: 100, 500 and 1000
trees
8, 16 and 24 nearest neighbors
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
MNIST DATA APPLY MODEL ON TEST SET
28.000 digits without known labels.
Our best model predicted the label for
these digits.
First 100 predicted digits, together with
the handwritten digits are displayed
here.
Red numbers are predicted labels. We
see obvious some mistakes…..
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SPEECH RECOGNITION
DIGITS RECORDED WITH IPHONE
1 2
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SPEECH RECOGNITION
WAV files consists of ~ 30.000 points too much redundancy
Use spectral analysis to convert signal to frequency domain
Still too much apply principle components
TRAIN DATA
8 spoken ‘ones’ in wav files
8 spoken ‘twos’ in wav files
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SPEECH RECOGNITION
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
SPEECH RECOGNITION
Zero errors on training data
Zero errors on test data
Also 8 ‘ones’ and 8 ‘twos’
In Enterprise Miner:
Neural network with 9 neurons in one hidden layer
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
STRANGE FACE
DETECTIONCOMBO OF OPEN API / R & SAS
Little joke on my colleagues….
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
STRANGE FACE
DETECTIONCOMBO OF OPEN API / R & SAS
Get free API key for Face++
Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces? Predictive modeling / machine learning
Who is the Brad Pit? Nearest Neighbor
Strange faces? proc neural / auto-encoder
Create R script to
Retrieve the SAS faces from our site
put them trough the Face++ API
Collect JSON results and store them in an ABT
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
STRANGE FACE
DETECTIONLOOK ALIKE FACES
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
STRANGE FACE
DETECTIONBRAD PIT LOOK A LIKES
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
STRANGE FACE
DETECTION
STRANGE FACES
SAS Faces, Actors Faces
Read more on my blog
C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .
STRANGE FACE
DETECTIONCOMBO OF OPEN API / R & SAS
SAS Faces, Actors Faces
Read more on my blog