NCCU Chih-Ming
NCCUChih-Ming
Kaggle
https://www.facebook.com/groups/kaggletw/
https://www.facebook.com/groups/kaggletw/
3
https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/
Why Compete?
For Fun: Competing with others like running or racing
For Learning: Improving your abilities
Why Compete?
For Fun: Competing with others like running or racing
For Learning: Improving your abilities
What's Your Motivation?
Why Compete?
For Fun: Competing with others like running or racing
For Learning: Improving your abilities
What's Your Motivation?
Why Compete?
Related Websites
http://dc.dsp.im/index.php
http://dc.dsp.im/index.php
Related Websites
https://tianchi.aliyun.com/
https://tianchi.aliyun.com/
10
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
An Overview of Making a Prediction
Raw Data
An Overview of Making a Prediction
Data CleaningRaw Data
An Overview of Making a Prediction
Data CleaningRaw Data
Preprocessing Tasks
- Data Cleaning
- Data Transformation
- Data Reduction
Recap
Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value
Outlier Detection - https://en.wikipedia.org/wiki/Outlier
Redundant Features - we usually remove them
mean /median /mode /clustering /modeling methods
https://en.wikipedia.org/wiki/Outlier
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200 drop
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200
Std.1.21.16.3
AgeUser A 19User B 27User C 200
Out.110
drop
add label
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200
Std.1.21.16.3
AgeUser A 19User B 27User C 200
Out.110
drop
AgeUser A 19User B 27User C 36
replaceadd label
mean /median /mode /clustering /modeling methods
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Random Splitting
Split by Time
Split by Id
Cross-Validation
Hold A Proper Validation
Random Splitting
Split by Time
Split by Id
TrainValidation
Test
7 DAYS7 DAYS
5/20 5/275/13
or
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
RoadMap (1)
Feature Engineering
Feature Encoding
- Binary Features
- Numeric Features
- Categorical Features
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
https://en.wikipedia.org/wiki/Feature_engineering
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
https://en.wikipedia.org/wiki/Feature_engineering
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
answer: like
https://en.wikipedia.org/wiki/Feature_engineering
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
answer: like gender: boy, age: 22
https://en.wikipedia.org/wiki/Feature_engineering
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
answer: like gender: boy, age: 22 artist: Cheer, genre: pop
https://en.wikipedia.org/wiki/Feature_engineering
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
0 = girl*w1 + age*w2 + may_day*w3 + indie*w4
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4??????
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
0 2 13 26
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
0 2 13 26
0 AND 2 AND 13 AND 26
= 1
Feature Encoding
Convert the extracted features to be readable by applied machine learning models. - Binary Features
- Numeric Features
- Categorical Features
Binarization
GenderUser A boyUser B boyUser C girl
Binarization
GenderUser A boyUser B boyUser C girl 0 for boy
1 for girl
Binarization
GenderUser A boyUser B boyUser C girl
0User A 0User B 0User C 1
0 for boy
1 for girl
Binarization
AgeUser A 17User B 27User C 32
Binarization
AgeUser A 17User B 27User C 32 0 for 18
Binarization
AgeUser A 17User B 27User C 32
1User A 0User B 1User C 1
0 for 18
Binarization
Gender AgeUser A boy 17User B boy 27User C girl 32
0 1User A 0 0User B 0 1User C 1 1
Binarization
Gender AgeUser A boy 17User B boy 27User C girl 32
0 1User A 0 0User B 0 1User C 1 1
What about using BINNING?
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
One-hot Encoding
2 3 4
User A 1 0 0User B 0 1 0User C 0 0 1
mayday sekai_no_oware the_beatles
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
Grouping by language5 6 7
User A 1 0 0User B 0 1 0User C 0 0 1
CHN JPN ENG
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
Grouping by language2 3 4 5 6 7
User A 1 0 0 1 0 0User B 0 1 0 0 1 0User C 0 0 1 0 0 1
mayday sekai_no_oware the_beatles CHN JPN ENG
Numerical Features
T1
T2U
T3
23
1
6
Numerical Features
T1
T2U
T3
23
1
6
T1 T2 T3
23 1 6count
Numerical Features
T1
T2U
T3
23
1
6
T1 T2 T3
23 1 6
1 0 1
count
binary
Numerical Features
T1
T2U
T3
23
1
6
T1 T2 T3
23 1 6
1 0 1
count
binary
probability 23/30 1/30 6/30
Numerical Features
Standardization / Normalization
Rescaling
Transform the Distribution - logarithmic transformation - tf-idf like transformation
Binning / Sampling
https://en.wikipedia.org/wiki/Feature_scaling
required bymany ML algorithms
https://en.wikipedia.org/wiki/Data_transformation_(statistics)
https://en.wikipedia.org/wiki/Feature_scalinghttps://en.wikipedia.org/wiki/Data_transformation_(statistics)
Categorical vs. Numerical
Ordinal Categories
HATE DONT MIND LIKE LOVE
0 1 2 3
0
2
4
6
8
HATE DON'T MIND LIKE LOVE
exp(value)
RoadMap (2)
Advanced Feature Engineering
Feature Extraction
- Feature Interactions
- Data Minings
- Dimensional Reduction
- Domain-specific Process
Example (1)
Text-based - Vector Space Model - Word Embeddings
https://en.wikipedia.org/wiki/Vector_space_model
MAN
WOMAN
KING
QUEEN
need stemming? lemmatization?
https://en.wikipedia.org/wiki/Vector_space_model
Example (2)
Text
......
Example (2)
Text
......
segmentation
[] [] [] [][] [] [] [][] [] []
Example (2)
Text
......
segmentation
[] [] [] [][] [] [] [][] [] []
:1 :1 :1 :2:1 :1 :2
:1filtering
dummyvariables
Example (2)
Text
......
segmentation
[] [] [] [][] [] [] [][] [] []
:1 :1 :1 :2:1 :1 :2
:1filtering
WordEmbeddings?
dummyvariables
:2 :1 :1 :4:2 :1 :1:0.8
AdvancedWeighting?
Example (3)
Image-based - SIFT - Convolutional NN
https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network
https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network
Example (4)
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
Example (4)
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
+ (boy AND pop)*w5
Realize the Meaning Behind the Observed Features
2017/05/20 08:00
Taipei
Holiday? Weekday?
Day? Night?
Asia
Mandarin
RoadMap (3)
Popular ML Models
Linear Models
Tree-based Models
Support Vector Machines
K-means
Understand the Pros and Cons
Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
Understand the Pros and Cons
Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
Linear Non-Linear
Understand the Pros and Cons
Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
Linear Non-Linear
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
CHN NON-CHN
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Age=16 Age=26
CHN NON-CHN
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Genre=Pop?
CHN
Age=26
NON-CHN
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Genre=Pop?
Age=12 Age=19
CHN
YES NOAge=26
NON-CHN
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Genre=Pop? Western?
Age=12 Age=19 Age=29 Age=23
CHN NON-CHN
YES YESNO NO
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
https://en.wikipedia.org/wiki/Artificial_neural_network
https://en.wikipedia.org/wiki/Artificial_neural_network
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
https://en.wikipedia.org/wiki/Artificial_neural_network
A BOY?
CHN?
POP?
https://en.wikipedia.org/wiki/Artificial_neural_network
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
https://en.wikipedia.org/wiki/Artificial_neural_network
A BOY?
CHN?
POP?
LIKE
DISLIKE
https://en.wikipedia.org/wiki/Artificial_neural_network
Understand the Pros and Cons
SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters
https://en.wikipedia.org/wiki/Support_vector_machine
https://en.wikipedia.org/wiki/Support_vector_machine
Understand the Pros and Cons
Gradient Boosting Machine (GBM) - usually unbeatable in using dense feature sets
Factorization Machine (FM) - the master in dealing with sparse data
Understand the Pros and Cons
There are too many details
Find some online courses or ML books The Elements of Statistical Learning
Machine Learning, A Probabilistic Perspective
Programming Collective Intelligence
Information Science and Statistics
Pattern Recognition and Machine Learning
Understand the Pros and Cons
Ill tell you everything.
HOMEWORK PROJECT
Find a Dataset or Join a Competition
Apply the Techniques Presented in this Course
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
What to HAND IN
A Paper Report
Any toolkit is welcome.
Select and use one or multiple topics you learned from the course.
Showing the Performance Difference of Using Different Methods.
ANY QUESTION?changecandy at gmail