Page 1
1©Cloudera,Inc.Allrightsreserved.
MartonBalassi|SolutionsArchitect@Cloudera*@MartonBalassi |[email protected]
Judit Feher |DataScientist@[email protected]
*WorkcarriedoutwhileemployedbyMTASZTAKIontheStreamlineproject.
StreamingMLwithFlink
ThisprojecthasreceivedfundingfromtheEuropeanUnion’sHorizon2020researchandinnovationprogramundergrantagreementNo688191.
Page 2
2©Cloudera,Inc.Allrightsreserved.
Outline
• CurrentFlinkML APIthroughanexample
• Addingstreamingpredictors
• Onlinelearning
• UsecasesintheStreamlineproject
• Summary
Page 3
3©Cloudera,Inc.Allrightsreserved.
FlinkML exampleusage
val env = ExecutionEnvironment.getExecutionEnvironmentval trainData = env.readCsvFile[(Int,Int,Double)](trainFile)val testData = env.readTextFile(testFile).map(_.toInt)
val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)
model.fit(trainData)
val prediction = model.predict(testData)prediction.print()
Givenahistorical(training)datasetofuserpreferencesletusrecommenddesirableitemsforasetofusers.
Designmotivatedbythesci-kitlearnAPI.Moreathttp://arxiv.org/abs/1309.0238.
Page 4
4©Cloudera,Inc.Allrightsreserved.
FlinkML exampleusage
val env = ExecutionEnvironment.getExecutionEnvironmentval trainData = env.readCsvFile[(Int,Int,Double)](trainFile)val testData = env.readTextFile(testFile).map(_.toInt)
val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)
model.fit(trainData)
val prediction = model.test(testData)prediction.print()
Givenahistorical(training)datasetofuserpreferencesletusrecommenddesirableitemsforasetofusers.
Thisisabatchinput.Butdoesitneedtobe?
Page 5
5©Cloudera,Inc.Allrightsreserved.
Alittlerecommendertheory
Itemfactors
User sideinformation User-Item matrixUser factors
Item sideinformation
U
I
PQ
R
• Rispotentiallyhuge,approximateitwithP∗Q• PredictionisTopK(user’srow∗ Q)
Page 6
6©Cloudera,Inc.Allrightsreserved.
Predictionisanaturalfitforstreaming.
Page 7
7©Cloudera,Inc.Allrightsreserved.
Acloser (schematic)look at the API
trait PredictDataSetOperation[Self, Testing, Prediction] {def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]
}
trait PredictOperation[Instance, Model, Testing, Prediction] {
def getModel(instance: Instance) : DataSet[Model]
def predict(value: Testing, model: Model) : DataSet[Prediction]}
ADataSet andarecordlevelAPItoimplementthealgorithm(Predictionisalwaysdoneonamodelalreadytrained)
TherecordlevelversionisarguablymoreconvenientItiswrappedintoadefaultdatasetlevelimplementation
Page 8
8©Cloudera,Inc.Allrightsreserved.
Acloser (schematic)look at the API
trait Estimator {def fit[Training](training: DataSet[Training])(implicit f: FitOperation[Training]) = {
f.fit(training)}
}
trait Transformer extends Estimator {def transform[I,O](input: DataSet[I])(implicit t: TransformDataSetOperation[I,O]) = {
t.transform(input)}
}
trait Predictor extends Estimator {def predict[Testing](testing: DataSet[Testing])(implicit p: PredictDataSetOperation[T]) = {
p.predict(testing)}
}
Threewell-pickedtraitsgoalongway
Page 9
9©Cloudera,Inc.Allrightsreserved.
Couldwesharethemodelwithastreamingjob?
Page 10
10©Cloudera,Inc.Allrightsreserved.
Learninbatch,predictinstreaming
val env = ExecutionEnvironment.getExecutionEnvironmentval strEnv = StreamExecutionEnvironment.getExecutionEnvironmentval trainData = env.readCsvFile[(Int,Int,Double)](trainFile)val testData = env.socketTextStream(...).map(_.toInt)
val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)
model.fit(trainData)
val prediction = model.predictStream(testData)prediction.print()
Page 11
11©Cloudera,Inc.Allrightsreserved.
Acloser (schematic)look at the streaming API
trait PredictDataSetOperation[Self, Testing, Prediction] {def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]
}
trait PredictDataStreamOperation[Self, Testing, Prediction] {def predictDataStream(instance: Self, input: DataStream[Testing]) : DataStream[Prediction]
}
• ImplicitconversionsfromthebatchPredictorstoStreamPredictors• Themodelisstoredthenloadedintoastateful RichMapFunction processing
theinputstream• DefaultwrapperimplementationstosupportboththeDataStreamleveland
therecordlevelimplementations• Addingthestreamingpredictorimplementationforanalgorithmgiventhe
batchoneistrivial
Page 12
12©Cloudera,Inc.Allrightsreserved.
Recommender systems in batchvs onlinelearning
• “30M”MusiclisteningdatasetcrawledbytheCrowdRecteam
• Implicit,timestampedmusiclisteningdataset• Eachrecordcontains:[timestamp,user,artist,album,track,…]
• Wealwaysrecommendandlearnwhentheuserinteractswithanitematthefirsttime
• ~50,000users,~100,000artists,~500,000tracks
• This happens when we shuffle the time
• Apartially batchonlinesystem
Page 13
Use cases in the Streamline project
Judit FehérHungarian Academy of Sciences
Page 14
How iALS works and why is it different from ALS
ALS Problem to solve: 𝑅# = 𝑃&𝑄#– Linear regression
Error function
𝐿 = 𝑅 − 𝑅* +,-.
/+ 𝜆2 𝑃 +,-.
/+ 𝜆3 𝑄 +,-.
/
Implicit error function
𝐿 = 4 𝑤6,# �̂�6,# − 𝑟6,#/
:;,:<
6=>,#=>
+ 𝜆24 𝑃6 /:;
6=>
+ 𝜆34 𝑄# /:<
#=>
• Weighted MSE
• 𝑤6,# = ?𝑤6,# if(𝑢, 𝑖) ∈ 𝑇𝑤I otherwise 𝑤I ≪
𝑤6,#• Typical weights:
𝑤I = 1, 𝑤6,# = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖
• What does it mean?– Create two matrices from the events– (1) Preference matrix
• Binary • 1 represents the presence of an event
– (2) Confidence matrix• Interprets our certainty on the
corresponding values in the first matrix• Negative feedback is much less certain
Page 15
Machine learning: batch, streaming? Combined?
Streaming recommeder
• Online learning
• Update immediately, e.g. with large learning rate
• Data streaming
• Read training/testing data only once, no chance to store
• Real time / Interactive
+ More timely, adapts fast
- Challenging to implement
Batch recommender
• Repeatedly read all training data multiple times
• Stochastic gradient: use multipletimes in random order
• Elaborate optimization procedures, e.g. SVM
+ More accurate (?)
+ Easy to implement (?)
Page 16
Contextualized recommendation (NMusic)
Social recommendation Geo recommendation
R.Palovics,A.A.Benczur,L.Kocsis,T.Kiss,E.Frigo. "Exploiting temporalinfluence inonlinerecommendation",ACMRecSys (2014)
Palovics,Szalai,Kocsis,Pap,Frigo,Benczur.„Location-Aware OnlineLearning for Top-kHashtag Recommendation”,LocalRec (2015)
Page 17
Internet Memory Research use cases
Identify events that influence consumer behavior (product purchases, media consumption)Events influence people
Before a football match, people buy beer, chips, …Specific events influence specific people (requires user profiles)
A football fan does not play Angry Birds during a football match
Annotation by logistic regressionTrain over data in restStreaming predict crawl time
Page 18
Portugal Telecom use cases
MEO quadruple-playFeatures
InternetTV (IPTV)Mobile phoneLandline phone
Current challenges
Heterogeneous dataHeterogeneous technical solutionsCustomers profilingCross-domain recommendation1TB/day
Page 20
Development at Sztaki
iALS- Flink already has explicit ALS- The implementation of the implicit version is done- Currently testing the algorithm's accuracy
Matrix factorization- Distributed algorithm*- We have a working prototype tested on smaller matrices but it still needs optimization
Logistic regression- Implementation in progress- It is based on stochastic gradient descent, but in Flink there is only a batch version- Currently working on the gradient descent implementation
Metrics- Implementation and testing is finished- We need to create a pull request
*R. Gemulla et al, “Large scale Matrix Factorization with Distributed Stochastic Gradient Descent”, KDD 2011.
Page 21
21©Cloudera,Inc.Allrightsreserved.
Summary
• Scalaisagreat tool for buildingDSLs
• FlinkML’s APIismotivated by scikit-learn
• StreamingisanaturalfitforMLpredictors
• Onlinelearningcanoutperformbatchincertaincases
• TheStreamline projectbuilds on Flink,aims to contribute back
as much ofthe results as possible