Introduction to Hivemall and it’s new features in v0.4 Research Engineer Makoto YUI @myui 2015/10/20 Hivemall meetup #2 1 Tweet w/ #hivemallmtup http://eventdots.jp/event/571107
IntroductiontoHivemallandit’snewfeaturesinv0.4
ResearchEngineerMakotoYUI@myui
2015/10/20Hivemallmeetup#2 1
Tweetw/#hivemallmtup
http://eventdots.jp/event/571107
Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureDataMymissioninTDisdevelopingML-as-a-Service
Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.Workedonalarge-scaleMachineLearningprojectandParallelDatabases
Ø 2009.03Ph.D.inComputerSciencefromNAISTØ SuperprogrammerawardfromtheMITOU
Foundation
WhoamI?
2015/10/20Hivemallmeetup#2 2
Agenda
1. WhatisHivemall
2. HowtouseHivemall
3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine
4. DevelopmentRoadmapofHivemall
2015/10/20Hivemallmeetup#2 3
WhatisHivemallScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
2015/10/20Hivemallmeetup#2 4
https://github.com/myui/hivemall
WhatisHivemall
HadoopHDFS
MapReduce(MR v1)
Hive /PIG
Hivemall
ApacheYARN
ApacheTezDAGprocessing MRv2
MachineLearning
QueryProcessing
ParallelDataProcessingFramework
ResourceManagement
DistributedFileSystem
2015/10/20Hivemallmeetup#2 5
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
2015/10/20Hivemallmeetup#2 6
ListofFeaturesinHivemallv0.3.2Classification(bothbinary- andmulti-class)✓ Perceptron✓ PassiveAggressive(PA)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA
Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad✓AdaDELTA
kNN andRecommendation✓Minhash andb-BitMinhash(LSHvariant)✓ Similarity SearchusingK-NN
(Euclid/Cosine/Jaccard/Angular)✓MatrixFactorization
Featureengineering✓ FeatureHashing✓ FeatureScaling(normalization, z-score)✓ TF-IDFvectorizer✓ Polynomial Expansion
AnomalyDetection✓ LocalOutlierFactor
TreasureDatasupportsHivemallv0.3.2-3
2015/10/20Hivemallmeetup#2 7
Ø CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.andmore
Ø GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
Ø ChurnDetection• Algorithm:Regression• OISIXandmore
Ø Item/Userrecommendation• Algorithm:Recommendation(MatrixFactorization/kNN)• Adtech Companies,ISPportal,andmore
Ø ValuepredictionofRealestates• Algorithm:Regression• Livesense
IndustryusecasesofHivemall
82015/10/20Hivemallmeetup#2
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation2015/10/20Hivemallmeetup#2 9
CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
2015/10/20Hivemallmeetup#2 10
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
2015/10/20Hivemallmeetup#2 11
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom
e2006tfidf_train;
ApplyingaMin-MaxFeatureNormalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
2015/10/20Hivemallmeetup#2 12
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
2015/10/20Hivemallmeetup#2 13
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
2015/10/20Hivemallmeetup#2 14
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
2015/10/20Hivemallmeetup#2 15
create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select
train_multiclass_cw(addBias(features),label) as (label,feature,weight)
from news20mc_train_x3
union allselect
train_multiclass_arow(addBias(features),label) as (label,feature,weight)
from news20mc_train_x3
union allselect
train_multiclass_scw(addBias(features),label)as (label,feature,weight)
from news20mc_train_x3
) t group by label, feature;
Ensemblelearningforstablepredictionperformance
Juststackpredictionmodelsbyunionall
26 / 43162015/10/20Hivemallmeetup#2
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
2015/10/20Hivemallmeetup#2 17
HowtouseHivemall- Prediction
CREATETABLElr_predictasSELECTt.rowid,sigmoid(sum(m.weight)) asprobFROMtesting_exploded tLEFTOUTERJOINlr_model mON(t.feature =m.feature)GROUPBYt.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
2015/10/20Hivemallmeetup#2 18
HowtouseHivemall
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
2015/10/20Hivemallmeetup#2 19
2015/10/20Hivemallmeetup#2 20
OnlinePredictiononMySQL(RDBMS)
Quick(msec)responseonaRDBMSbyaddinganindextofeaturecolumn
bit.ly/hivemall-mysql
Agenda
1. WhatisHivemall
2. HowtouseHivemall
3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine
4. DevelopmentRoadmapofHivemall
2015/10/20Hivemallmeetup#2 21
Features tobesupportedinHivemallv0.4
2015/10/20Hivemallmeetup#2 22
1.RandomForest• classification,regression• BasedonSmilegithub.com/haifengl/smile
2.FactorizationMachine• classification,regression (factorization)
Plannedtoreleasev0.4inOct.
FactorizationMachineareoftenusedbydatasciencecompetitionwinners(Criteo/Avazu CTRprediction)
2015/10/20Hivemallmeetup#2 23
RandomForestinHivemallv0.4
EnsembleofDecisionTrees
Alreadyavailableonadevelopment(smile)branchandit’susageisexplainedintheprojectwiki
Bagging
2015/10/20Hivemallmeetup#2 24
TrainingofRandomForest
Out-of-bagtestsandVariableImportance
2015/10/20Hivemallmeetup#2 25
2015/10/20Hivemallmeetup#2 26
PredictionofRandomForest
2015/10/20Hivemallmeetup#2 27
RandomForest
DEMO
http://bit.ly/hivemall-rf
2015/10/20Hivemallmeetup#2 28
FactorizationMachine
MatrixFactorization
2015/10/20Hivemallmeetup#2 29
FactorizationMachine
Contextinformation(e.g.,time)canbeconsidered
Source:http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
2015/10/20Hivemallmeetup#2 30
FactorizationMachine
FactorizationModelwithdegress=2(2-wayinteraction)
Global BiasRegression coefficience
of j-th variable
Pairwise Interaction
Factorization
2015/10/20Hivemallmeetup#2 31
FactorizationMachine
FactorizationMachine≈ PolynomialRegression+Factorization
Forafeature[a,b],thedegree-2polynomialfeaturesare[1,a,b,a^2,ab,b^2].
bit.ly/hivemall-poly
2015/10/20Hivemallmeetup#2 32
FactorizationMachine
DEMO
Agenda
1. WhatisHivemall
2. HowtouseHivemall
3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine
4. DevelopmentRoadmapofHivemall
2015/10/20Hivemallmeetup#2 33
Features tobesupportedinHivemallv0.4.1
2015/10/20Hivemallmeetup#2 34
1.GradientTreeBoosting• classifier,regression
2.Field-awareFactorizationMachine• classification,regression (factorization)• Existingimplementation, i.e.,LibFFM,onlycanbeappliedforclassification
Plannedtoreleasev0.4.1inNov/Dec.
2015/10/20Hivemallmeetup#2 35
GradientTreeBoosting(orGradientBoostingTrees)
RF≈Bagging+DecisionTreesparallel execution ofdecision trees
GBT≈Boosting+DecisionTreesSequential execution ofdecision trees
2015/10/20Hivemallmeetup#2 36
GradientTreeBoosting
Features tobesupportedinHivemallv0.4.2
2015/10/20Hivemallmeetup#2 37
1. OnlineLDA• topicmodeling,clustering
2. MixserveronApacheYARN• Serviceforparametersharingamongworkers• workingw/@maropu
Plannedtoreleasev0.4.2inDec/Jan.
Externalservicetoshareparametersbydistributedtrainingprocessesinthemiddleoftraining
2015/10/20Hivemallmeetup#2 38
What’sMixServer?
・・・・・・
Modelupdates
Async addPiggybackif…
AVG/Argmin KLDaccumulator
hash(feature)%N
Non-blockingChannel(singlesharedTCPconnectionw/TCPkeepalive)
classifiers
Mixserv.Mixserv.
Computation/trainingisnotbeingblocked
Takingbenefitsofasynchronousnon-blockingI/OisthecoreideabehindHivemall’s MIXprotocol
2015/10/20Hivemallmeetup#2 39
createtablekdd10a_pa1_model1asselectfeature,cast(voted_avg(weight)asfloat)asweightfrom(selecttrain_pa1(addBias(features),label,"-mixhost01,host02,host03")
as(feature,weight)fromkdd10a_train_x3
)tgroupbyfeature;
HowtouseMixServer
ConclusionandTakeaway
Newfeaturesinv0.4
2015/10/20Hivemallmeetup#2 40
• RandomForest• FactorizationMachine
Morewillfollowinv0.4.1
NextActions• ProposeHivemalltoApacheIncubator
• NewHivemallLogo
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFsThelatestversionofHivemallisavailableonTreasureDataandusedbyseveralcompaniesIncludingOISIX,Livesense,Scaleout,andFreakout.
2015/10/20Hivemallmeetup#2 41
BeyondQuery-as-a-Service!
WeOpen-source!Weinvented..
Wearehiringmachinelearningengineer!
2015/10/20Hivemallmeetup#2 42
Additionalslides
Recommendation
RatingpredictionofaMatrix
Canbeappliedforuser/ItemRecommendation
432015/10/20Hivemallmeetup#2
44
MatrixFactorization
Factorizeamatrixintoaproductofmatriceshavingk-latentfactor
2015/10/20Hivemallmeetup#2
45
MeanRating
MatrixFactorization
Regularization
Biasforeachuser/item
CriteriaofBiasedMF
2015/10/20Hivemallmeetup#2
Factorization
46
TrainingofMatrixFactorization
Support iterative training using local disk cache2015/10/20Hivemallmeetup#2
47
PredictionofMatrixFactorization
2015/10/20Hivemallmeetup#2
ØAlgorithmisdifferentSpark:ALS-WR(considersregularization)Hivemall:Biased-MF(considersregularizationandbiases)
ØUsabilitySpark:100+lineScalacodingHivemall:SQL(wouldbemoreeasytouse)
ØPredictionAccuracyAlmostsameforMovieLens 10Mdatasets
2015/10/20Hivemallmeetup#2 48
ComparisontoSparkMLlib
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]
UnsupervisedLearning:AnomalyDetection
Sensordataetc.
AnomalydetectionrunsonaseriesofSQLqueries
492015/10/20Hivemallmeetup#2
2015/10/20Hivemallmeetup#2 50
AnomaliesinaSensorData
Source:https://codeiq.jp/q/207
ImageSource:https://en.wikipedia.org/wiki/Local_outlier_factor2015/10/20Hivemallmeetup#2 51
LocalOutlierFactor(LoF)
BasicideaofLOF:comparingthelocaldensityofapointwiththedensities ofitsneighbors
2015/10/20Hivemallmeetup#2 52
DEMO:LocalOutlierFactor
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]