1 Machine Learning Algorithms Summary + R Code Supervised
Learning Algorithms 1 Supervised Learning by Empirical Risk
Minimization (EMR) 1 1 Empirical Risk Minimization and Inductive
Bias 1 2 Ordinary Least Squares (OLS)1 3 Ridge Regression1 4 LASSO1
5 Logistic Regression1 6 Regression Classifier1 7 Linear Support
Vector Machines (SVM)1 8 Generalized Additive Models (GAMs)1 9
Projection Pursuit Regression (PPR)1 10 Neural Networks (NNETs)1 11
Classification and Regression Trees (CARTs)1 12 Random Forests1 13
Rotation Forest1 14 Smoothing Splines2 Non ERM Supervised Learning2
1 k-Nearest Neighbour (KNN)2 2 Kernel Regression2 3 Local
Likelihood and Local ERM2 4 Boosting2 5 Learning Vector
Quantizations (LVQ)3 Dimensionality Reduction In Supervised
Learning3 1 Variable Selection3 2 LASSO3 3 Principal Component
Regression (PCAR)3 4 Partial Least Squares (PLS)3 5 Canonical
Correlation Analysis (CCA)3 6 Reduced Rank Regression (RRR)4
Generative Models In Supervised Learning4 1 Fisher's Linear
Discriminant Analysis (LDA)4 2 Fisher's Quadratic Discriminant
Analysis (QDA)4 3 Naive Bayes5 Ensembles5 1 Committee Methods5 2
Bayesian Model Averaging5 3 Stacking5 4 Bootstrap Averaging
(Bagging)5 5
Boosting~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Classification: Kernel Density Classification. Naive Bayes
Classifier - has the form of a generalized additive model. The
models are fit in quite different ways though. Mixture Models for
Density Estimation and Classification - can be viewed as a kind of
kernel method. 2 1 9 Projection Pursuit Regression (PPR) Another
way to generalize the hypothesis class F, which generalizes the GAM
model, is to allow f to be some simple function of a linear
combination of the predictors, of the form (1.9)( ) ( )1Mm mmf x g
w x== , where both gm and wm are learned from the data. The
regularization is now performed by choosing M and the class of {
}1Mmmg=.
Note:PPRisnotapureERM.JustliketheGAMproblem,inthePPRproblem{
}1Mmmg=are learned by Kernel Regression. Solving the PPR problem is
thus a hybrid of ERM and Kernel Regression algorithms. Note: If M
is taken arbitrarily large, for appropriate choice of gm the PPR
model can approximate any continuous function in Rp arbitrarily
well. Such a class of models is called a universal approximator.
However this generality comes at a price. Interpretation of the
fitted model is usually difficult, because each input enters into
the model in a complex and multifaceted way. As a result, the PPR
model is most useful for prediction, and not very useful for
producing an understandable model for the data.Notice also- that
the neural network model with one hidden layer has exactly the same
form as the projection pursuit model described above. The
difference is that the PPR model uses nonparametric functions
gm(v),while the neural network uses a far simpler function based on
sigmoid(v). 1.10 Neural Networks (NNETs) - Single Hidden Layer We
introduce the NNET model via the PPR model, and not through its
historically original construction. In the language of Eq.(1.9), a
single-layer{feed-forward neural network, is a model where{
}1Mmmg=are not learned from the data, but rather assumed a-priori.
( ) ( ) :m m mg x x =where{ }1,Mm mm =only are learned from the
data. A typical activation function is the standard logistic
CDF:()11tte=+. As can be seen, the NNET is merely a non-linear
regression model. The parameters of which are often called weights.
Loss Functions: Like any other ERM problem, we are free to choose
the appropriate loss function. Universal Approximator: Like the
PPR, even when { }1Mmmg= are fixed beforehand, the class is still a
universal approximator. Regularization: regularization of the model
is done via the selection of the , the number of nodes/variables in
the network and the number of layers. 3 1.11 Classification and
Regression Trees (CARTs) CARTs are a type of ERM where f(x) include
very non smooth functions that can be interpreted as "if-then"
rules, also known as decision trees. The hypothesis class of CARTs
includes functions of the form ( ){ }1mMm x Rmf x c I== The
parameters of the model are the different conditions { }1MmmR=and
the function's value at each condition{ }1Mmmc=. Regularization: is
done by the choice of M which is called the tree depth. Loss
Functions: As usual, a squared loss can be used for continuous
outcomes y. For categorical outcomes, the loss function is called
the impurity measure. Impurity Measure One can use either a
misclassification error, the multinomial likelihood (knows as the
deviance, or cross-entropy), or a first order approximation of the
latter known as the Gini Index. Universal Approximator: CART is a
universal approximator. Random Forests Trees are veryflexible
hypothesis classes. They thus have small bias but large variance.
Bagging trees will reduce this variance by averaging trees from
different bootstrap samples. Alas, the variance (thus the MSE) of
bagged trees is lower bounded by the fact the trees use the same
variables, and are thus correlated. To remedy this, [Breiman, 2001]
proposed to fit trees to bootstrapped samples, using only a random
subset of variables. This decorrelates between the trees, this
allowing a reduction in the variances of the trees (thus their
MSE). 4 1.14 Smoothing Splines 5 Unsupervised Learning 1
Introduction to Unsupervised Learning2 Density Estimation 2 1
Parametric Density Estimation2 2 Kernel Density Estimation2 3
Graphical Models 3 High Density Regions 3 1 Association Rules4
Linear-Space Embeddings4 1 Principal Components Analysis (PCA)4 2
Random Projections 4 3 Sparse Principal Component Analysis (sPCA) 4
4 Multidimensional Scaling (MDS) 4 5 Local MDS4 6 Isometric Feature
Mapping (Isomap) 5 Non-Linear-Space Embeddings5 1 Kernel Principal
Component Analysis (kPCA) 5 2 Self Organizing Maps (SOM) 5 3
Principal Curves and Surfaces5 4 Local Linear Embedding (LLE) 5 5
Auto Encoders5 6 Matrix Factorization 5 7 Information Bottleneck 6
Latent Space Generative Models 6 1 Factor Analysis (FA) 6 2
Independent Component Analysis (ICA) 6 3 Exploratory Projection
Pursuit 6 4 Compressed Sensing 6 5 Generative Topographic Map (GTM)
6 6 Finite Mixtures6 7 Hidden Markov Models (HMM) 6 8 Latent Space
Graphical Models6 9 Latent Dirichlet Allocation (LDA)6 10
Probabilistic Latent Semantic Indexing (PLSI) 6 11 Prediction by
Partial Matching (PPM)6 12 Dynamic Markov Compression (DMC)7 Random
Graph Models7 1 Erdos Renyi7 2 Exchangeable Graph Model 7 3 p1
Graph Model 7 4 p2 Graph Model 7 5 Stochastic Block Graph Model 7 6
Latent Space Graph Model 7 7 Exponential Random Graphs (ERGMs)8
Cluster Analysis 8 1 K-Means Clustering 8 2 K-Medoids Clustering
(PAM) 8 3 Quality Threshold Clustering (QT)8 4 Hierarchical
Clustering6 8 5 Fuzzy Clustering 8 6 Self Organizing Maps (SOM) 8 7
Spectral Clustering8 8 Bi Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3.1
Association Rules (Market Basket Analysis; Aprioiri algorithm)
Association rules, or market basket analysis, or affinity analysis,
can be seen as approximating the joint distribution with a
region-wise constant function. Apriori Algorithm Terminology The
algorithm: (use dummy variables for 0/1 response = "in basket"/"Not
in basket"). The first pass over the data computes the support
(relative frequency) of all single-item sets. Those whose support
is less than the threshold are discarded. The second pass computes
the support of all item sets of size two that can be formed from
pairs of the single items surviving the first pass. In other words,
to generate all frequent itemsets with |K| = m, we need to consider
only candidates such that all of their m ancestral item sets of
size m 1 are frequent. Those size-two item sets with support less
than the threshold are discarded. Each successive pass over the
data considers only those item sets that can be formed by combining
those that survived the previous pass with those retained from the
first pass. Passes over the data continue until all candidate rules
from the previous pass have support less than the specified
threshold. > Example: suppose the item set K = {peanut butter,
jelly, bread} andconsider the rule {peanut butter, jelly} =>
{bread}. A support value of 0.03 for this rule means that peanut
butter, jelly, and bread appeared together in 3% of the market
baskets. A confidence of 0.82 for this rule implies that when
peanut butter and jelly were purchased, 82% of the time bread was
also purchased. If bread appeared in 43% of all market basketsthen
the rule {peanut butter, jelly} => {bread} would have a lift of
1.95. The goal of this analysis is to produce association rules (A
=> B) with bothhigh values of support and confidence(A => B).
7 Examples of Association Rules: 4 Linear Space Embedding Methods
Linear space embedding are a class of dimensionality reduction
techniquesthat map the data X into a lower dimensional linear space
M. The mapping itself,: f X M can be linear or nonlinear. We denote
the low dimensional representationof the data by: ( ) X f X M = .
The idea of ERM and Inductive Bias also applies to unsupervised
learning. We seek some f that does not incur too much loss, on
average. I.e., we seek to minimize R(f). Remark: Two
interpretations of "linear" can be found in the literature. It may
refer to the nature of the low dimensional space approximating the
data, or to the nature of the embedding operation. 4.1 PCA
Maximizing under a constraint, using Lagrange-Multipliers: Where[ ]
[ ] , Cov vX Cov vX v v = = . 8 PCA is such a basic technique it
has been rediscovered and renamed independently in many fields. It
can be found under the names of discrete Karhunen-Loeve Transform;
Hotteling Transform; Proper Orthogonal Decomposition (POD);
Eckart-Young Theorem;Schmidt-Mirsky Theorem; Empirical Orthogonal
Functions; Empirical Eigenfunction Decomposition;Empirical
Component Analysis; Quasi-Harmonic Modes; Spectral
Decomposition;Empirical Modal Analysis; and possibly more.
Example:Consider human height and weight data. While clearly two
dimensional data, you don't really need both to understand how
"big" are the people in the data. This is because; height and
weight vary mostly along a single dimension, which can be
interpreted as the "bigness" of an individual. This is why,
physicians use the Body Mass Index (BMI) as an indicator of size,
instead of a two-dimensional measurement. Assume now that you wish
to give each individual a size score that is a linear combination
of height and weight, PCA does just that. It returns the linear
combination that has the most variability, i.e., the combination
which best distinguishes between individuals. Notice we have
currently offered two motivations for PCA: (i) Find linear
combinations that best distinguish between observations, i.e.,
maximize variance. (ii) Find the linear subspace the bets
approximates the data. The reason these two problems are
equivalent, is due to the use of the squares error. Informally
speaking, the data has some total variance. This variance can be
decomposed into the part captured in M, and the part not captured.
Note: Usually for simplicity of exposition, we will assume that the
data X has been mean centered. Terminology: Principal Components:
The linear combinations of the features, which best separatebetween
observations. In our example - the "bigness" index of each
individual.The first component captures the most variance, the
second components, the second most variance, etc. In terms of M,
the principal components are an orthogonal basis for M. Scores:
Synonymous to Principal Components. Loadings: The weights of each
data point in each principal component. In our example, the
importance of the height and weight in constructing the "bigness"
score. PCA as a Graph Method Starting from the maximal variance
motivation, it is perhaps not surprising that PCA depends only on
the similarities between features, as measured by their empirical
covariance. The linearity of the target manifold was there by
assumption. The building blocks of all these graph-based
dimensionality reduction methods are: 1. Compute some similarity
graph G (or dissimilarity graph D) from the raw features. 9 2. Call
upon graph embedding theory to map the data points into the target
manifold M. To summarize:Task = dim reduce Type = optimization
Input = Graph (G) Output = embedding function Sparse Principal
Component Analysis (sPCA) When analyzing the PCA results, we often
wish to understand which features contribute to which component.
This is much easier when the loadings (A) are sparse, i.e., include
many zeroes. sPCA performs this in LASSO style, by means of l1
regularization. 4.4 Multidimensional Scaling (MDS) MDS - Both
self-organizing maps and principal curves and surfaces map data
points in Rp to a lower dimensional manifold. Multidimensional
scaling (MDS) has a similar goal, but approaches the problem in a
somewhat different way. MDS represents high-dimensional data in a
low-dimensional coordinate system. MDS requires only the
dissimilarities dij , in contrast to the SOM and principal curves
and surfaces which need the data points xi. MDS aims at
representing a network (= a weighted graph) of distances (or
similarities) between observations, by embedding the observations
in a q dimensional linear subspace, while preserving the original
distances. 5 Non-Linear Space Embedding Methods The fact that the
linear-space embedding of the data depends only some similarity
graph has laid a bridge between feature embedding, such as PCA, and
graph embedding methods such as MDS. Moreover, it has opened the
door for replacing the covariance similarity, with many other
similarity measures.Classic MDSis simply PCA when starting from G,
thus viewed as a graph embedding problem. kPCA plugs kernel
similarities instead of covariance similarities. LocalMDS and LLE
follow a similar motivation using local measures of similarity. PCA
solution can be cast in terms of the covariance between individuals
(G = X'X) or the Euclidean distances (D). In particular, we show
that all the information on the location (mean) of X, needed for
the PCA reconstruction, is actually encoded in G (or D). Kernel
Principal Component Analysis (kPCA) The optimization problem is: (
) { }argmaxgCov g X , where g(X) = best separating score
(function). We thus have two matters to attend:(i) We need to
constrain g(x) so that it does not overfit.(ii) We need the problem
to be computable. This is precisely the goal of kPCA. We have
already encountered a similar problem with Smoothing Splines. It is
thus not surprising that the solution has the same form. Namely, if
we choose the right g's, the solution 10 of the optimization
problem takes a very simple form. The classes of such g's are known
as Reproducing Kernel Hilbert Spaces (RKHS). Nonlinear Dimension
Reduction and Local Multidimensional Scaling - These methods can be
thought of as flattening the manifold, and hence reducing the data
to a set of low-dimensional coordinates that represent their
relative positions in the manifold. They are useful for problems
where signal-to-noise ratio is very high (e.g., physical systems),
and are probably not as useful for observational data with lower
signal-to-noise ratios. Three Methods of Nonlinear MDS: ISOMAP=
Isometric feature mapping (Tenenbaum et al., 2000) - constructs a
graph to approximate the geodesic distance between points along the
manifold. Specifically, for each data point we find its
neighbors-points within some small Euclidean distance of that
point. We construct a graph with an edge between any two
neighboring points. The geodesic distance between any two points is
then approximated by the shortest path between points on the graph.
Finally, classical scaling is applied to the graph distances, to
produce a low-dimensional mapping. LLE = Local linear embedding
(Roweis and Saul, 2000) - takes a very different approach, trying
to preserve the local affine structure of the high-dimensional
data. Each data point is approximated by a linear combination of
neighboring points. Then a lower dimensional representation is
constructed that best preserves these local approximations. LLE
aims at finding linear subspaces that are good approximations of
small neighborhoods of the whole data X. It is similar in spirit to
Isomap and LocalMDS (x5.4.5). It differs, however, in the way
similarities are computed, and in the way embedding are performed.
In particular, as the name may suggest, LLE performs local
embedding to linear subspaces. To summarize: Task = dim. reduction
Type = algorithm Input= graph (G)Output = data embedding Concept =
local distance Local MDS (Chen and Buja, 2008) - takes the simplest
and arguably the most direct approach. We define N to be the
symmetric set of nearby pairs of points; specifically a pair (i,
i') is in N if point i is among the K-nearest neighbors of i', or
vice-versa. Self Organizing Maps (SOM) SOMs, are a
non-linear-subspace dimensionality reduction method, aimed at good
clustering. It is non-linear because the algorithm (which cannot be
cast as an ERM problem, i.e., optimization problem) returns an
embedding into a non-linear manifold. To summarize: Task = dim.
reduction Type = algorithm Input = X (data) Output = parametric
curve or surfaceConcept = self consistency => I.e., a curve with
a path that is the average of all it's closest data points. Self
Consistency Roughly speaking, one can think of this curve as a
parameterized function, connecting all the k-means cluster centers
in the smoothest way possible. 11 8 Cluster Analysis Gaussian
Mixtures as Soft K-means Clustering. K-means Clustering - the
algorithm is appropriate when the dissimilarity measure is taken to
be squared Euclidean distance. This requires all of the variables
to be of the quantitative type. In addition, using squared
Euclidean distance places the highest influence on the largest
distances. This causes the procedure to lack robustness against
outliers that produce very large distance. K-medoids Clustering -
For a given cluster assignment (C) find the observation in the
cluster minimizing total distance to other points in that cluster.
This algorithm assumes attribute data, but the approach can also be
applied to data described only by proximity matrices. There is no
need to explicitly compute cluster centers. 12
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Recommender
Systems Algorithms 1. Content Filtering2. Collaborative Filtering3.
Hybrid Filtering4. Recommender
Systems~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The two
main approaches to recommender systems include content filtering
and collaborative filtering. 1.Content Filtering In content
filtering, the system is assumed to have some background
information on the user (say, because he logged in), and uses this
information to give him recommendations. The recommendation in this
case, is approached as a supervised learning problem: the system
learns to predict a product's rating based on the user's features.
2.Collaborative Filtering Unlike content filtering, in
collaborative filtering, there is no external information on the
user or the products, besides the ratings of other users.
Collaborative filtering can be approached as a supervised learning
problem, or as an unsupervised learning problem. This is because it
is neither. It is essentially a missing data problem. The two main
approaches to collaborative filtering include neighborhood methods,
and latent factor models. a.The neighborhood methods to
collaborative filtering rest on the assumption that similar
individuals have similar tastes. If someone similar to individual i
has seen movie j, then i should have a similar opinion. b.The
latent factor models approach to collaborative filtering rests on
the assumption that the rankings are a function of some latent user
attributes and latent movie attributes. This idea is not a new one,
as we have seen it in the context of unsupervised learning in
factor analysis (FA) and independent component analysis (ICA). This
is why this approach is more commonly known as theMatrix
Factorization approach collaborative filtering. We can present
several matrix factorization problems in the ERM framework. Hybrid
Filtering After introducing the ideas of content filtering and
collaborative filtering, why not marry the two? Hybrid filtering is
the idea of imputing the missing data, thus making recommendations,
using both a viewer's attributes, and other viewers' preferences.
It can be presented as an ERM problem. Recommender Systems
Terminology Content Based Filtering: A supervised learning approach
to recommendations. Collaborative Filtering: A missing data
imputation approach to recommendations. Memory Based Filtering: A
non-parametric (neighborhood) approach to collaborative filtering.
13 Model Based Filtering: A latent space generative model approach
to collaborative filtering. Misc notes: ======== The Relation
Between Supervised and Unsupervised Learning It may be surprising
that collaborative filtering can be seen as both an unsupervised
and a supervised learning problem. But these are not mutually
exclusive problems.In unsupervised learning we try to learn the
joint distribution of x, i.e., try to learn the relationship
between any variable in x to the rest, we may see it as several
supervised learning problems. In each, a different variable in x
plays the role of y. The Kernel Trick Applies to: SVM, PCA,
canonical correlation analysis, ridge regression, spectral
clustering, Gaussian processes, and more (k-nearest neighbor (kNN)
is also a kernel method). Think of smoothing splines, it was quite
magical that without constraining the hypothesis class F, the ERM
problem has a finite dimensional closed form solution. The property
of an infinite dimensional problem having a solution in a finite
dimensional space is known as the kernel property The problem is
then- what type of penalties J(f) will return simple solutions to:
(1) The answer is: functions that belong to (RKHS) Reproducing
Kernel Hilbert Space function space. The Bayesian View of RKHS Just
as the ridge regression has a Bayesian interpretation, so does the
kernel trick. Informally, the functions solving Eq.(1) can be seen
as the posterior mode if our prior beliefs postulate that the
function we are trying to recover is a Gaussian zero-mean process
with covariance given by K. Generative Models By generative model
we mean that we specify the whole data distribution. This is
particularly relevant to supervised learning where many methods
only assume the distribution of P(y|x) without stating the
distribution of P(x). LDA, QDA, and Naive Bayes, follow this exact
same rational. Dimensionality Reduction - It is thus intimately
related to lossy compression in information theory. -
Dimensionality reduction is often performed before supervised
learning to keep computational complexity low. 14 R code Supervised
Learning Code library(magrittr) # for piping library(dplyr) # for
handeling data frames # Some utility functions: l2 % sum %>%
sqrtl1 % sum MSE % meanmissclassification