Feature Engineering and Classifier Ensemble for KDD Cup 2010 Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with HF Yu, HY Lo, HP Hsieh, JK Lou, T McKenzie, JW Chou, PH Chung, CH Ho, CF Chang, YH Wei, JY Weng, ES Yan, CW Chang, TT Kuo, YC Lo, PT Chang, C Po, CY Wang, YH Huang, CW Hung, YX Ruan, YS Lin, SD Lin and HT Lin July 25, 2010 Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 1 / 39
41
Embed
Feature Engineering and Classi er Ensemble for KDD Cup 2010
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Feature Engineering and ClassifierEnsemble for KDD Cup 2010
Chih-Jen Lin
Department of Computer ScienceNational Taiwan University
Joint work with HF Yu, HY Lo, HP Hsieh, JK Lou, T McKenzie,JW Chou, PH Chung, CH Ho, CF Chang, YH Wei, JY Weng, ESYan, CW Chang, TT Kuo, YC Lo, PT Chang, C Po, CY Wang,YH Huang, CW Hung, YX Ruan, YS Lin, SD Lin and HT Lin
July 25, 2010Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 1 / 39
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 2 / 39
Team Members
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 3 / 39
Team Members
Team Members
At National Taiwan University, we organized acourse for KDD Cup 2010
Three instructors, two TAs, 19 students and one RA
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 10 / 39
Team Members
Tiger (RA)
Yu-Shi Lin (林育仕)Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 11 / 39
Team Members
Snoopy (TAs)
Hsiang-Fu Yu (余相甫) and Hung-Yi Lo (駱宏毅)Snoopy and Pikachu are IDs of our team in the finalstage of the competition
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 12 / 39
Team Members
Instructors
林智仁 (Chih-Jen Lin), 林軒田 (Hsuan-Tien Lin) and 林守德 (Shou-De Lin)
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 13 / 39
Initial Approaches and Some Settings
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 14 / 39
Initial Approaches and Some Settings
Initial Thoughts and Our Approach
We suspected that this competition would be verydifferent from past KDD Cups
Domain knowledge seems to be extremely importantfor educational systems
Temporal information may be crucial
At first, we explored a temporal approach
We tried Bayesian networks
But quickly found that using a traditionalclassification approach is easier
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 15 / 39
Initial Approaches and Some Settings
Initial Thoughts and Our Approach(Cont’d)
Traditional classification:
Data points: independent Euclidean vectors
Suitable features to reflect domain knowledge andtemporal information
Domain knowledge, temporal information: important,but not as extremely important as we thought in thebeginning
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 16 / 39
Initial Approaches and Some Settings
Our Framework
Problem
SparseFeatures
CondensedFeatures
Ensemble
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 17 / 39
Initial Approaches and Some Settings
Validation Sets
• Avoid overfitting theleader board
• Standard validation
⇒ ignore time series
• Our validation set: lastproblem of each unit intraining set
• Simulate the procedure toconstruct testing sets
A unit of problems
problem 1 ∈ V
problem 2 ∈ V...
last problem ∈ V
V : internal trainingV : internal validation
• In the early stage, we focused on validation sets
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 18 / 39
Sparse Features and Linear Classification
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 19 / 39
Sparse Features and Linear Classification
Problem
SparseFeatures
CondensedFeatures
Ensemble
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 20 / 39
Sparse Features and Linear Classification
Basic Sparse Features
Categorical: expanded to binary featuresstudent, unit, section, problem, step, KC
Numerical: scaled by log(1 + x)opportunity value, problem view
A89: algebra 2008 2009B89: bridge to algebra 2008 2009
Use LIBLINEAR developed at National TaiwanUniversity (Fan et al., 2008)
We consider logistic regression instead of SVM
Training time: about 1 hour for 20M instances and30M features (B89)
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 25 / 39
Sparse Features and Linear Classification
Result Using Sparse Features
Leader board results:
A89 B89Basic sparse features 0.2895 0.2985Best sparse features 0.2784 0.2830Best leader board 0.2759 0.2777
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 26 / 39
Condensed Features and Random Forest
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 27 / 39
Condensed Features and Random Forest
Problem
SparseFeatures
CondensedFeatures
Ensemble
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 28 / 39
• Use correct first attempt rate (CFAR). Example: astudent named sid:
CFAR =# steps with student = sid and CFA = 1
# steps with student = sid
• CFARs for student, step, KC, problem, (student, unit),(problem, step), (student, KC) and (student, problem)
Temporal feaures: the previous ≤ 6 steps with the samestudent and KC• An indicator for the existence of such steps
• Correct first attempt rate
• Average hint request rateChih-Jen Lin (National Taiwan Univ.) July 25, 2010 29 / 39
Condensed Features and Random Forest
Condensed Features (Cont’d)
Temporal features:
When was a step with the same student name andKC be seen?
Binary features to model four levels:
Same day, 1-6 days, 7-30 days, > 30 days
Opportunity and problem view: scaledTotal 17 condensed features
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 30 / 39
Condensed Features and Random Forest
Training by Random Forest
Due to a small # of features, we could try severalclassifiers via Weka (Hall et al., 2009)
Random Forest (Breiman, 2001) showed the bestperformance:
A89 B89Basic sparse features 0.2895 0.2985Best sparse features 0.2784 0.2830Best condensed features 0.2824 0.2847Best leader board 0.2759 0.2777
This small feature set works well
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 31 / 39
Ensemble and Final Results
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 32 / 39
Ensemble and Final Results
Problem
SparseFeatures
CondensedFeatures
Ensemble
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 33 / 39
Ensemble and Final Results
Linear Regression for Ensemble
Linear regression to ensemble sub-team results
minw
‖y − Pw‖2 +λ
2‖w‖2
y: labels of testing set: l × 1; l : # testing data
P : l× (# results from students)
Truncated to [0, 1] : min(1,max(0,Pw))
Need some techniques as y unavailable
Decision of the regularization parameter λ
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 34 / 39
Ensemble and Final Results
Ensemble Results
Ensemble significantly improves the results
A89 B89 Avg.Basic sparse features 0.2895 0.2985 0.2940Best sparse features 0.2784 0.2830 0.2807Best condensed features 0.2824 0.2847 0.2835Best ensemble 0.2756 0.2780 0.2768Best leader board 0.2759 0.2777 0.2768
Our team ranked 2nd on the leader board
Difference to the 1st is small; we hoped that oursolution did not overfit leader board too much andmight be better on the complete challenge set
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 35 / 39
Ensemble and Final Results
Final Results
Rank Team name Leader board Cup1 National Taiwan University 0.276803 0.2729522 Zhang and Su 0.276790 0.2736923 BigChaos @ KDD 0.279046 0.2745564 Zach A. Pardos 0.279695 0.2765905 Old Dogs With New Tricks 0.281163 0.277864
Team names used during the competition:
Snoopy ⇒ National Taiwan University
BbCc ⇒ Zhang and Su
Cup scores generally better than leader board
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 36 / 39
Discussion and Conclusions
Outline
Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 37 / 39
Discussion and Conclusions
Diversities in Learning
We believe that one key to our ensemble’s success is thediversity
Feature diversity
Classifier diversity
Different sub-teams try different ideas guided by theirhuman intelligence
Our student sub-teams even have biodiversity
Mammals: snoopy, tiger
Birds: weka, duck
Insects: armyants, trilobite
Marine animals: starfish, sunfish
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 38 / 39
Discussion and Conclusions
Diversities in Learning
We believe that one key to our ensemble’s success is thediversity
Feature diversity
Classifier diversity
Different sub-teams try different ideas guided by theirhuman intelligenceOur student sub-teams even have biodiversity
Mammals: snoopy, tiger
Birds: weka, duck
Insects: armyants, trilobite
Marine animals: starfish, sunfishChih-Jen Lin (National Taiwan Univ.) July 25, 2010 38 / 39
Discussion and Conclusions
Conclusions
Feature engineering and classifier ensemble seem tobe useful for educational data mining
All our team members worked very hard, but we arealso a bit lucky
We thank the organizers for organizing thisinteresting and fruitful competition
We also thank National Taiwan University forproviding a stimulating research environment
Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 39 / 39