Overview of Machine Learning & Feature Engineering Machine Learning 101 Tutorial Strata + Hadoop World, NYC, Sep 2015 Alice Zheng, Dato 1
1
Overview of Machine Learning & Feature Engineering
Machine Learning 101 TutorialStrata + Hadoop World, NYC, Sep 2015Alice Zheng, Dato
2
About us
Chris DuBoisIntro to recommenders
Alice ZhengOverview of ML
Piotr TeterwakIntro to image search & deep learning
Krishna SridharDeploying ML as a predictive service
Danny BicksonTA
Alon PalomboTA
3
Why machine learning?
Model data.Make predictions.Build intelligent
applications.
4
ClassificationPredict amongst a discrete set of classes
5
Input Output
6
Spam filtering data prediction
Spamvs.
Not spam
Text classification
EDUCATION
FINANCE
TECHNOLOGY
8
RegressionPredict real/numeric values
9
Stock market
Input
Output
10
SimilarityFind things like this
11
Similar productsProduct I’m buying
Output: other products I might be interested in
12
Given image, find similar images
http://www.tiltomo.com/
13
Recommender systemsLearn what I want before I know it
14
15
Playlist recommendationsRecommendations form
coherent & diverse sequence
16
Friend recommendationsUsers and “items” are of
the same type
17
ClusteringGrouping similar items
18
Clustering images
Goldberger et al.
Set of Images
19
Clustering web search results
20
Machine learning … how?Data
Answers
I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …
Many systems
Many tools
Many teams
Lots of methods/jargon
21
The machine learning pipeline
I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …
Raw data
FeaturesModels
Predictions
Deploy inproduction
22
Three things to know about ML• Feature = numeric representation of raw data• Model = mathematical “summary” of features• Making something that works = choose the right
model and features, given data and task
Feature = numeric representation of raw data
24
Representing natural text
It is a puppy and it is extremely cute.
What’s important? Phrases? Specific words? Ordering?
Subject, object, verb?
Classify: puppy or not?
Raw Text
{“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 }
Bag of Words
25
Representing natural text
It is a puppy and it is extremely cute.
Classify: puppy or not?
Raw Text Bag of Wordsit 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector representation
26
Representing images
Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image: millions of RGB triplets,one for each pixel
Classify: person or animal?Raw Image Bag of Visual Words
27
Representing imagesClassify: person or animal?Raw Image Deep learning features
3.29-15
-5.2448.31.3647.1
-1.9236.52.8395.4-19-89
5.0937.8
Dense vector representation
28
Feature space in machine learning• Raw data high dimensional vectors• Collection of data points point cloud in feature
space• Feature engineering = creating features of the
appropriate granularity for the task
Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes.
-- Masha Gessen, “Perfect Rigor”
30
Algebra vs. Geometry
a
bc
a2 + b2 = c2
Algebra GeometryPythagoreanTheorem
(Euclidean space)
31
Visualizing a sphere in 2D
x2 + y2 = 1
a
bc
Pythagorean theorem:a2 + b2 = c2
x
y
1
1
32
Visualizing a sphere in 3D
x2 + y2 + z2 = 1
x
y
z
1
11
33
Visualizing a sphere in 4D
x2 + y2 + z2 + t2 = 1
x
y
z
1
11
34
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:All physical objects without holes
is “equivalent” to a sphere.
35
The power of higher dimensions• A sphere in 4D can model the birth and death
process of physical objects• High dimensional features can model many things
Visualizing Feature Space
37
The challenge of high dimension geometry• Feature space can have hundreds to millions of
dimensions• In high dimensions, our geometric imagination is
limited- Algebra comes to our aid
38
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy andit is extremely cute
I have a puppy andit is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
39
Visualizing bag-of-words
puppy
cute
1
11
extremely
I have a puppy and it is extremely cute
I have an extremely cute cat
I have a cute puppy
40
Document point cloudword 1
word 2
Model = mathematical “summary” of features
42
What is a summary?• Data point cloud in feature space• Model = a geometric shape that best “fits” the
point cloud
43
Clustering modelFeature 2
Feature 1
Group data points tightly
44
Classification modelFeature 2
Feature 1
Decide between two classes
45
Regression modelTarget
Feature
Fit the target values
Visualizing Feature Engineering
47
When does bag-of-words fail?
puppy
cat
2
11
have
I have a puppy
I have a catI have a kitten
Task: find a surface that separates documents about dogs vs. cats
Problem: the word “have” adds fluff instead of information
I have a dogand I have a pen
1
48
Improving on bag-of-words• Idea: “normalize” word counts so that popular words
are discounted• Term frequency (tf) = Number of times a terms
appears in a document• Inverse document frequency of word (idf) =
• N = total number of documents• Tf-idf count = tf x idf
49
From BOW to tf-idf
puppy
cat
2
11
have
I have a puppy
I have a catI have a kitten
idf(puppy) = log 4idf(cat) = log 4idf(have) = log 1 = 0
I have a dogand I have a pen
1
50
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4tfidf(cat) = log 4tfidf(have) = 0
I have a dogand I have a pen,I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens uninformative
dimensions in the BOW point cloud
51
Entry points of feature engineering• Start from data and task
- What’s the best text representation for classification?• Start from modeling method
- What kind of features does k-means assume?- What does linear regression assume about the data?
Dato’s Machine Learning Platform
53
Dato’s machine learning platform
Raw data
Features ModelsPredictions
Deploy inproduction
GraphLab Create
Dato Distributed
Dato Predictive Services
54
Data structures for feature engineering
Features SFrames
User Com.
Title Body
User Disc.
SGraphs
55
Machine learning toolkits in GraphLab Create• Classification/regression• Clustering• Recommenders• Deep learning• Similarity search• Data matching• Sentiment analysis• Churn prediction• Frequent pattern mining• And on…
Demo
57
Dimensionality reductionFeature 1
Feature 2
Flatten non-useful features
PCA: Find most non-flat linear subspace
58
PCA : Principal Component Analysis
Center data at origin
59
PCA : Principal Component AnalysisFind a line, such that the average distance of every data point to the line is minimized.
This is the 1st Principal Component
60
PCA : Principal Component AnalysisFind a 2nd line, - at right angles to the 1st
- such that the average distance of every data point to the line is minimized.
This is the 2nd Principal Component
61
PCA : Principal Component AnalysisFind a 3rd line - at right angles to the previous lines - such that the average distance of every data point to the line is minimized.
…There can only be as many principle components as the dimensionality of the data.
Demo
63
Coursera Machine Learning Specialization• Learn machine learning in depth• Build and deploy intelligent applications• Year long certification program• Joint project between University of Washington +
Dato • Details:
https://www.coursera.org/specializations/machine-learning
64
Next up today
[email protected] @RainyData, #StrataConf
11:30am - Intro to recommendersChris DuBois
1:30pm - Intro to image search & deep learningPiotr Teterwak
3:30pm - Deploying ML as a predictive serviceKrishna Sridhar