Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark Zoltán Zvara [email protected].hu Gábor Hermann [email protected]This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
66
Embed
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Building a Large-Scale, Adaptive Recommendation Engine with
This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
About us• Institute for Computer Science and Control, Hungarian Academy of
Sciences (MTA SZTAKI)• Informatics Laboratory• „Big Data – Momemtum” research group• „Data Mining and Search” research group
• Research group with strong industry ties• Ericsson, Rovio, Portugal Telekom, etc.
Agenda1. Recommendation systems and matrix factorization2. Batch vs. online3. Matrix factorization
1. Online2. Batch + online
4. Solution in Spark & Flink5. Conclusions
Recommendation systems
Recommendation systems
ghermann
coll.filt. kiemlése
𝑅Recommendation with matrix factorization
5
1
3
5
2
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
min𝑢∗ ,𝑖∗
∑(𝑝 ,𝑞 )∈ 𝜅𝑅
(𝑟𝑝𝑞−𝜇−𝑏𝑝−𝑏𝑞−𝑢𝑝 𝑖𝑞)2+¿+𝜆 ∑𝑝∈𝜅𝑈
(‖𝑢𝑝‖2¿+𝑏𝑝
2 )+𝜆 ∑𝑞∈𝜅𝐼
(¿‖𝑖𝑞‖2+𝑏𝑞
2 )
¿¿
Zoltán rated Rogue One with 5 stars
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
5 4 -4
325
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
5 4 -4
325
3
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
3
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
5 4 -4
325
3
[user; item; time; rating]
𝑅Batch training
𝑈𝐼item vector
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
[user; item; time; rating]
𝑅Batch training
𝑈𝐼item vector
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
[user; item; time; rating]
𝑅Batch training
𝑈𝐼item vector
325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
𝑅Online training
𝑈𝐼item vector
325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5 3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
2 5 4 2 4
𝑅Online training
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
𝑅Online training
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
Batch + online combination
But how to scale?• Spotify streamed 20 billion hours of music in 2015• YouTube over a billion users, billions of video views every day• Use distributed data-analytics frameworks• How can we combine batch + online?
Apache Spark vs. Apache Flink
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
2 5 4 2 4
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
326
25 -6 -2
need to co-locate
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
135
24 -3 -1
need to co-locatethen update
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
135
24 -3 -1
need to co-locatethen updatesend updates
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
• Concurrent modification• Similar problem with batch SGD• Distributed SGD
(Gemulla et al. 2011)
Online MF in Spark
val ratings: DStream[Rating] = ...
we have our input
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?Use batch DSGD for online updates!(discussion issue SPARK-6407)
Batch + online combination• 30M music listening Last.fm dataset• Weekly batch training• Evaluation weekly average• on every incoming listening
• Around 45.000 users
Online MF: Spark vs. Flink• 30M music listening Last.fm dataset read from 12 Kafka partitions• Spark batch duration: 5 sec• Time of processing X ratings• DSGD algorithm