7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
1/67
Flink vs. Spark
Slim Baltagi @SlimBaltagiDirector of Big Data Engineering, Fellow
Capital One
https://twitter.com/SlimBaltagihttps://twitter.com/SlimBaltagi7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
2/67
2
Agenda
I. otivation for t!is talkII. Apac!e Flink vs. Apac!e Spark"
III. #ow Flink is $sed at Capital One"
I%. &!at are some ke' takeawa's"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
3/67
3
I. otivation for t!is talk
(. arketing fl$ff
). Conf$sing statements
*. B$rning +$estions incorrect or
o$tdated answers
-. #elping ot!ers eval$ating Flink vs.Spark
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
4/67
4
(. arketing fl$ff
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
5/67
5
(. arketing fl$ff
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
6/67
6
). Conf$sing statements
Sparkis alread' an e/cellent piece of software and is
advancing ver' +$ickl'.0o vendor 1 no new pro2ect 1is likel' to catc! $p. C!asing Spark wo$ld 3e a waste of
time, and wo$ld dela' availa3ilit' of real4time anal'tic
and processing services for no good reason.5 So$rce6
ap7ed$ce and Spark, ike Olson. C!ief Strateg'
Officer, Clo$dera. Decem3er, *8t!)8(*!ttp699vision.clo$dera.com9mapred$ce4spark9
:oal6 one engine forall data so$rces, workloadsand
environments. So$rce6 Slide (; of a!aria. C?O, Data3ricks.Fe3r$ar' )8t!, )8(;. !ttp699www.slides!are.net9data3ricks9new4directions4for4apac!e4spark4in4)8(;
http://vision.cloudera.com/mapreduce-spark/http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://vision.cloudera.com/mapreduce-spark/http://vision.cloudera.com/mapreduce-spark/7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
7/67
7
*. B$rning +$estions incorrect or
o$tdated answersro2ects t!at depend on smart optimiersrarel' work
well in real life.5 C$rt onas!, onas! 7esearc!.an$ar' (, )8(;!ttp699www.comp$terworld.com9article9)(893ig4data4digest4!ow4man'4!adoop
s4do4we4reall'4need.!tml
Flink is 3asicall' a Spark alternative o$t of :erman',w!ic! I=ve 3een dismissing as $nneeded5. C$rt
onas!, onas! 7esearc!, arc! ;, )8(;. !ttp699www.d3ms).com9)8(;98*98;9cask4and4cdap9
Of co$rse, t!is is all a 3$llis! arg$ment for Spark Gor
Flink, if I=m wrong to dismiss its c!ances as a SparkcompetitorH.5 C$rt onas!, onas! 7esearc!,Septem3er ), )8(;. !ttp699www.d3ms).com9)8(;989)9t!e4potential4significance4of4clo$dera4k$d$9
http://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.html7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
8/67
8
*. B$rning +$estions incorrect or
o$tdated answers
?!e 3enefit of SparkJs micro43atc! model is t!at 'o$get f$ll fa$lt4tolerance and e/actl'4once processing
for t!e entire comp$tation, meaning it can recover all
state and res$lts even if a node cras!es. Flink and
Storm donJt provide t!isK5 atei >a!aria. C?O,
Data3ricks. a' )8(;!ttp699www.kdn$ggets.com9)8(;98;9interview4matei4a!aria4creator4apac!e4
spark.!tml
I $nderstand Spark Streaming $ses micro43atc!ing.
Does t!is increase latenc'" &!ile Spark does $se a
micro43atc! e/ec$tion model, t!is does not !avem$c! impact on applicationsK5 !ttp699spark.apac!e.org9fa+.!tml
http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
9/67
9
-. #elp ot!ers eval$ating Flink vs. SparkBesides t!e marketing fl$ff, t!e conf$sing statements,
t!e incorrect or o$tdated answers to 3$rning
+$estions, t!e little information on t!e s$32ect of Flinkvs. Spark is availa3le piecemealL
&!ile eval$ating different stream processing tools at
Capital One, we 3$ilt a framework listing categories
and over (88 criteria to assess t!ese streamprocessing tools.
In t!e ne/t section, I=ll 3e s!aring t!is framework and
$se it to compare Spark and Flink on a few ke'
criteria.
&e !ope t!is will 3e 3eneficial to 'o$ as well w!en
selecting Flink and9or Spark for stream processing.
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
10/67
10
Agenda
I. otivation for t!is talkII. Apac!e Flink vs. Apac!e Spark"
III. #ow Flink is $sed at Capital One"
I%. &!at are some ke' takeawa's"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
11/67
11
II. Apac!e Flink vs. Apac!e Spark"
(. &!at is Apac!e Flink"
). &!at is Apac!e Spark"
*. Framework to eval$ate Flink and Spark
-. Flink vs. Spark on a few ke' criteria
;. F$t$re work
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
12/67
12
(.&!at is Apac!e Flink"
S+$irrel6 Animal. In !armon' wit! ot!er animals in t!e
#adoop ecos'stem G>ooH6 elep!ant, pig, p't!on,
camel,...S+$irrel6 reflects t!e meaning of t!e word
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
13/67
13
(.&!at is Apac!e Flink"
Apac!e Flink is an open so$rce platform for
distri3$ted stream and 3atc! data processing.5!ttps699flink.apac!e.org9
See also t!e definition in &ikipedia6!ttps
699en.wikipedia.org9wiki9Apac!eMFlink
https://flink.apache.org/https://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://flink.apache.org/https://flink.apache.org/7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
14/67
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
15/67
15
). &!at is Apac!e Spark"
Apac!e SparkR is a fast and general engine for
large4scale data processing.5 !ttp699spark.apac!e.org9See also definition in &ikipedia6 !ttps
699en.wikipedia.org9wiki9Apac!eMSpark
Nogo was picked to reflect Nig!tning4fast cl$stercomp$ting
http://spark.apache.org/http://spark.apache.org/https://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttp://spark.apache.org/http://spark.apache.org/7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
16/67
16
). &!at is Apac!e Spark"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
17/67
17
*. Framework to eval$ate Flink and Spark
(. Backgro$nd). Fit4for4p$rpose Categories
*. Organiational4fit Categories
-. iscellaneo$s9Ot!er Categories
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
18/67
18
(. Backgro$nd
(.( Definition
(.) Origin
(.* at$rit'
).- %ersion
(.; :overnance model
(. Nicense model
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
19/67
19
). Fit4for4p$rpose Categories
).( Sec$rit'
).) rovisioning onitoring Capa3ilities).* Natenc' rocessing Arc!itect$re
).- State anagement
).; rocessing Deliver' Ass$rance
). Data3ase Integrations, 0ative vs. ?!ird
part' connector
). #ig! Availa3ilit' 7esilienc'
). Ease of Development
). Scala3ilit'
).(8 ni+$e Capa3ilities9Qe' Differentiators
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
20/67
20
). Fit4for4p$rpose Categories
).( Sec$rit'
).(.( A$t!entication, A$t!oriation
).(.) Data at rest encr'ption Gdata persisted in t!e
frameworkH
).(.* Data in motion encr'ption Gprod$cer
4Tframework 4T cons$merH
).(.- Data in motion encr'ption Ginter4node
comm$nicationH
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
21/67
21
). Fit4for4p$rpose Categories
).) rovisioning onitoring Capa3ilities
2.2.1 Robustness of Administration
2.2.2 Ease of maintenance Does tec!nolo"#
provide con$"uration% deplo#ment% scalin"%
monitorin"% performance tunin" and auditin"
capabilities&2.2.' onitorin" Alertin"
2.2.* +o""in"
2.2., Audit
2.2.- ransparent p"rade 0ersion up"rade
it! minimum dontime
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
22/67
22
). Fit4for4p$rpose Categories).* Natenc' rocessing Arc!itect$re
2.'.1 Supports tuple at a time% microbatc!%
transactional updates and batc! processin"
2.'.2 3omputational model
2.'.' Abilit# to reprocess !istorical data from
source
2.'.* Abilit# to reprocess !istorical data from
native en"ine
2.'., 3all e4ternal source (A56/database calls)
2.'.- 6nte"ration it! 7atc! (static) source2.'.8 Data #pes (ima"es% sound etc.)
2.'.9 Supports comple4 event processin" and
pattern detection vs. continuous operator
model (lo latenc#% :o control)
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
23/67
23
). Fit4for4p$rpose Categories
).* Natenc' rocessing Arc!itect$re
2.'.;
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
24/67
24
). Fit4for4p$rpose Categories
).- State anagement
).-.( Statef$l vs. Stateless
).-.) Is statef$l data ersisted locall' vs.
e/ternal data3ase vs. Ep!emeral).-.* 0ative rolling, t$m3ling and !opping
window s$pport
).-.- 0ative s$pport for integrated data store
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
25/67
25
). Fit4for4p$rpose Categories
).; rocessing Deliver' Ass$rance
).;.( :$arantee GAt least onceH).;.) :$arantee GAt most onceH
).;.* :$arantee GE/actl' onceH
).;.- :lo3al Event order g$aranteed).;.; >uarantee predictable and
repeatable outcomes( deterministic or
not)
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
26/67
26
). Fit4for4p$rpose Categories
). Data3ase Integrations, 0ative vs. ?!ird
part' connector
)..( 0oSPN data3ase integration
)..) File Format GAvro, ar+$et and ot!er
format s$pportH
)..* 7DBS integration
)..- In4memor' data3ase integration9 Cac!ing
integration
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
27/67
27
). Fit4for4p$rpose Categories
). #ig! Availa3ilit' 7esilienc')..( Can t!e s'stem avoid slowdown d$e to straggler node
)..) Fa$lt4?olerance Gdoes t!e tool !andlenode9operator9messaging fail$res wit!o$t catastrop!icall' failingH
)..* State recover' from in4memor'
)..- State recover' from relia3le storage
)..; Over!ead of fa$lt tolerance mec!anism GDoes failure!andlin" introduce additional latenc# or ne"ativel#
impact t!rou"!put&)
).. $lti4site s$pport Gm$lti4regionH
).. Flow control6 3ackpress$re tolerance from slow operators or
cons$mers
).. Fast parallel recover' vs. replication or serial recover' on
one node at a time
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
28/67
28
). Fit4for4p$rpose Categories
). Ease of Development)..( SPN Interface
)..) 7eal4?ime de3$gging option)..* B$ilt4in stream oriented a3straction Gstreams, windows, operators , iterators
4 e/pressive AIs t!at ena3le programmers to +$ickl' develop streaming data
applicationsH
)..- Separation of application logic from fa$lt tolerance
)..; ?esting tools and framework
2.9.- 3!an"e mana"ement multiple model deplo#ment ( E.".
separate cluster or can one create multiple independent redundant
streams internall#)
2.9.8 D#namic model sappin" (Support d#namic updatin" of
operators/topolo"#/DA> it!out restart or service interruption)
2.9.9 Re?uired @noled"e of s#stem internals to develop anapplication
).. ?ime to market for applications
2.9.1= Supports plu"in of e4ternal libraries
2.9.11 A56
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
29/67
29
). Fit4for4p$rpose Categories
). Scala3ilit'
)..( S$pports m$lti4t!read across m$ltiple
processors9cores
)..) Distri3$ted across m$ltiple mac!ines9servers
)..* artition Algorit!m)..- D'namic elasticit' 4 Scaling wit! minim$m impact9
performance penalt'
)..; #oriontal scaling wit! linear
performance9t!ro$g!p$t
).. %ertical scaling G:H
).. Scaling wit!o$t downtime
O f C
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
30/67
30
*. Organiational4fit Categories
*.( at$rit' Comm$nit' S$pport
*.) S$pport Nang$ages for Development
*.* Clo$d orta3ilit'
*.- Compati3ilit' wit! 0ative #adoop
Arc!itect$re
*.; Adoption of Comm$nit' vs. Enterprise
Edition
*. Integration wit! essage Brokers
* O i ti l fit C t i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
31/67
31
*. Organiational4fit Categories
*.( at$rit' Comm$nit' S$pport
*.(.( Open So$rce S$pport *.(.) at$rit' G'earsH
*.(.* Sta3le
*.(.- Centralied doc$mentation wit! versioning
s$pport
*.(.; Documentation of pro"rammin" A56 it!
"ood code e4amples
'.1.- 3entralied visible roadmap
'.1.8 3ommunit# acceptance vs. 0endor
driven
*.(. Contri3$tors
* O i ti l fit C t i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
32/67
32
*. Organiational4fit Categories
*.) S$pport Nang$ages for Development
*.).( Nang$age tec!nolog' was 3$ilt on
*.).) Nang$age s$pported to access
tec!nolog'
* O i ti l fit C t i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
33/67
33
*. Organiational4fit Categories
*.* Clo$d orta3ilit'
*.*.( Ease of migration 3etween clo$d vendors
*.*.) Ease of migration 3etween on premise to clo$d
*.*.* Ease of migration from on premise to complete
clo$d services
*.*.- Clo$d compati3ilit' GA&S, :oogle, A$reH
* O i ti l fit C t i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
34/67
34
*. Organiational4fit Categories
*.- Compati3ilit' wit! 0ative #adoop
Arc!itect$re
*.-.( Implement on top of #adoop A70 vs.
Standalone
*.-.) esos
*.-.* Coordination wit! Apac!e >ookeeper
* O i ti l fit C t i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
35/67
35
*. Organiational4fit Categories
*.; Adoption of Comm$nit' vs. Enterprise
Edition
*.;.( Open So$rce
*.;.) Enterprise S$pport
- i ll 9Ot! C t i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
36/67
36
-. iscellaneo$s9Ot!er Categories
-.( Best S$ited for
-.) Qe' $se case scenarios
-.* Companies $sing tec!nolog'
- Fli k S k f k it i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
37/67
37
-. Flink vs. Spark on a few ke' criteria
(. Streaming Engine
). Iterative rocessing*. emor' anagement
-. Optimiation
;. Config$ration. ?$ning
. erformance
- ( St i E i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
38/67
38
-.(. Streaming Engine
an' time4critical applications need to process large
streams of live dataand provide res$lts in real4time.For e/ample6Financial Fra$d detectionFinancial Stock monitoringAnomal' detection
?raffic management applicationsatient monitoringOnline recommenders
Some claim t!at ;U of streaming $se cases can
3e !andled wit! micro43atc!esL" 7eall'LLL
- ( St i E i
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
39/67
39
-.(. Streaming Engine
Spark=s micro43atc!ing isn=t good eno$g!L?ed D$nning, C!ief Applications Arc!itect at ap7,
talk at t!e Ba' Area Apac!e Flink eet$p on A$g$st), )8(;
!ttp699www.meet$p.com9Ba'4Area4Apac!e4Flink4eet$p9events9))-(;)-
9
?ed descri3ed several $se cases w!ere 3atc! and micro
3atc! processing is not appropriate and descri3ed w!'.#e also descri3ed w!at a tr$e streaming sol$tion needs
to provide for solving t!ese pro3lems.?!ese $se cases were taken fromreal ind$strial
sit$ations, 3$t t!e descriptions drove down to tec!nical
details as well.
- ( Streaming Engine
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
40/67
40
-.(. Streaming Engine
I wo$ld consider stream data anal'sis to 3e a ma2or
$ni+$e selling propositionfor Flink. D$e to its
pipelined arc!itect$re, Flink is a perfect matc! for 3ig
data stream processing in t!e Apac!e stack. %olker
arkl
7ef.6 On Apac!e Flink. Interview wit! %olker arkl, $ne )-t!)8(;
!ttp699www.od3ms.org93log9)8(;989on4apac!e4flink4interview4wit!4volker4markl9
Apac!e Flink $ses streams for all workloads6
streaming, SPN, micro43atc! and 3atc!. Batc! is 2$st
treated as a finite set of streamed data. ?!is makes
Flinkt!e most sop!isticated distri3$ted open so$rceBig Data processing engine Gnott!e most mat$reone
'etLH.
- ) Iterative rocessing
http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
41/67
41
-.). Iterative rocessing
&!' Iterations" an' ac!ine Nearning and :rap!
processing algorit!ms need iterationsL For e/ample6
ac!ine Nearning Algorit!msCl$stering GQ4eans, Canop', KH:radient descent GNogistic 7egression, atri/
FactoriationH
:rap! rocessing Algorit!msage47ank, Nine47ankat! algorit!ms on grap!s Gs!ortest pat!s,
centralities, KH:rap! comm$nities 9 dense s$34componentsInference GBelief propagationH
- ) Iterative rocessing
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
42/67
42
-.). Iterative rocessing
FlinkJs AI offers two dedicated iteration operations6
Iterateand Delta Iterate.
Flink e/ec$tes programs wit! iterationsas c'clicdata flows6 a data flow program Gand all its operatorsH
is sc!ed$led2$st once.In eac! iteration, t!e step f$nction cons$mest!e
entire inp$t Gt!e res$lt of t!e previo$s iteration, or t!einitial data setH, and comp$test!e ne/t version of t!e
partial sol$tion
- ) Iterative rocessing
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
43/67
43
-.). Iterative rocessing
Delta iterations r$n onl' onparts of t!e datat!at is
c!anging and can significantl' speed $p man'
mac!ine learning andgrap! algorit!ms 3eca$se t!ework in eac! iteration decreases as t!e n$m3er of
iterations goes on.
Doc$mentation on iterations wit! Apac!e Flink!ttp699ci.apac!e.org9pro2ects9flink9flink4docs4master9apis9iterations.!tml
- ) Iterative rocessing
http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
44/67
44
-.). Iterative rocessing
StepStep
Step Step Step
3lient
for (int i = 0; i < m axIterations; i+ + ) {// Execute M apReduce job
0on4native iterations in #adoop and Sparkare
implemented as reg$lar for4loops o$tside t!e s'stem.
- ) Iterative rocessing
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
45/67
45
-.). Iterative rocessing
Alt!o$g!Spark cac!es data across iterations, it still
needs to sc!ed$leand e/ec$tea new set of tasks foreac! iteration.
Spinning Fast Iterative Data Flows 4 Ewen et al. )8() 6
!ttp699vld3.org9pvld39vol;9p()Mstep!anewenMvld3)8().pdf ?!e
Apac!e Flink model for incremental iterative dataflow
processing. Academic paper.7ecap of t!e paper, $ne (, )8(;!ttp
6993log.acol'er.org9)8(;989(9spinning4fast4iterative4dataflows9
Doc$mentationon iterationswit! Apac!e Flink!ttp699ci.apac!e.org9pro2ects9flink9flink4docs4master9apis9iterations.!tml
- * emor' anagement
http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdfhttp://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
46/67
46
-.*. emor' anagement
P$estion6 Spark vs. Flink low memor' availa3le"
P$estion answered on stackoverflow.com!ttp699stackoverflow.com9+$estions9*(*;)9spark4vs4flink4low4memor'4
availa3le
?!e same +$estion still $nanswered on t!e Apac!e
Spark ailing NistLL !ttp699apac!e4flink4$ser4mailing4list4arc!ive.)**8;8.n-.na33le.com9spark4
vs4flink4low4memor'4availa3le4td)*-.!tml
- * emor' anagement
http://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-available7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
47/67
47
-.*. emor' anagement
Feat$res6CVV st'le memor' management inside t!e %
ser data stored in serialied 3'te arra's in %emor' is allocated, de4allocated, and $sed strictl'
$sing an internal3$ffer pool implementation.
Advantages6
(. Flink will not t!row an OO e/ception on 'o$.). 7ed$ctionof :ar3age Collection G:CH
*. %er' efficient disk spilling and network transfers
-. 0o 0eed for r$ntime t$ning
;. ore relia3le and sta3le performance
- * emor' anagement
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
48/67
48
-.*. emor' anagement
pub!ic c!ass " # {pub!ic $trin%& ord;pub!ic intcount;empt
#pa"e
ool of emor' ages
Sorting,
!as!ing,
cac!ing
S!$ffles9
3roadcasts
ser code
o32ects
anagednman
aged
Flink contains its own memor' management stack.
?o do t!at, Flink contains its own t'pe e/traction
and serialiationcomponents.
% #eap
0
etwork
B$ffers
- * emor' anagement
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
49/67
49
-.*. emor' anagement
eeking into Apac!e FlinkJs Engine 7oom 4 3' Fa3ian
#Wske, arc! (*, )8(; !ttp
699flink.apac!e.org9news9)8(;98*9(*9peeking4into4Apac!e4Flinks4Engine47oom.!tml$ggling wit! Bits and B'tes 4 3' Fa3ian #Wske, a'
((,)8(;!ttps699flink.apac!e.org9news9)8(;98;9((9$ggling4wit!4Bits4and4B'tes.!tml
emor' anagement GBatc! AIH 3' Step!an Ewen4
a' (, )8(;!ttps699cwiki.apac!e.org9confl$ence9pages9viewpage.action"pageIdX
;*-(;);
Flink added anOff4#eap optionfor its memor'
management component in Flink 8.(86!ttps699iss$es.apac!e.org92ira93rowse9FNI0Q4(*)8
- * emor' anagement
http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://issues.apache.org/jira/browse/FLINK-1320https://issues.apache.org/jira/browse/FLINK-1320https://issues.apache.org/jira/browse/FLINK-1320https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
50/67
50
-.*. emor' anagement
Compared to Flink, Spark is still 3e!ind in c$stom
memor' management3$t is catc!ing $p wit! itspro2ect ?$ngstenfor emor' anagement and Binar'
rocessing:manage memor' e/plicitl' and eliminate
t!e over!ead of % o32ect model and gar3age
collection. April ), )8(-!ttps699data3ricks.com93log9)8(;98-9)9pro2ect4t$ngsten43ringing4spark4closer4to43are4metal.!tml
It seems t!at Spark is adopting somet!ing similar to
Flinkand t!e initial ?$ngsten anno$ncement read
almost like Flink doc$mentationLL
- - Optimiation
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
51/67
51
-.- Optimiation
Apac!e Flink comes wit! an optimiert!at is
independent of t!e act$al programming interface.
It c!ooses a fitting e/ec$tion strateg' depending ont!e inp$ts and operations.
E/ample6 t!e oin operator will c!oose 3etween
partitioning and 3roadcasting t!e data, as well as
3etween r$nning a sort4merge42oin or a !'3rid !as!2oin algorit!m.
?!is !elps 'o$ foc$s on 'o$r application logic
rat!er t!an parallel e/ec$tion. P$ick introd$ction to t!e Optimier6 section of t!e
paper6
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
52/67
52
-.- Optimiation
7$n locall' on a data
sample
on t!e laptop7$n a mont! later
after t!e data evolved
#as! vs. Sort
artition vs. Broadcast
Cac!ing
7e$sing partition9sortE/ec$tion
lan A
E/ec$tion
lan B
7$n on large files
on t!e cl$ster
E/ec$tion
lan C
&!at is A$tomatic Optimiation" ?!e s'stemJs 3$ilt4in
optimier takes care of finding t!e 3est wa' to
e/ec$te t!e program in an' environment.
- - Optimiation
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
53/67
53
-.- Optimiation
In contrast to Flink=s 3$ilt4in a$tomatic optimiation,
Spark 2o3s !ave to 3e man$all' optimied and
adapted to specific datasets 3eca$se 'o$ need toman$all' control partitioning and cac!ing if 'o$
want to get it rig!t.Spark SPN $ses t!e Catal'st optimier t!at
s$pports 3ot! r$le43ased and cost43asedoptimiation. 7eferences6Spark SPN6 7elational Data rocessing in Spark
!ttp699people.csail.mit.ed$9matei9papers9)8(;9sigmodMsparkMs+l.pdf
Deep Dive into Spark SPN=s Catal'st Optimier!ttps699data3ricks.com93log9)8(;98-9(*9deep4dive4into4spark4s+ls4catal'st4
optimier.!tml
- ; Config$ration
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdfhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
54/67
54
-.;. Config$ration
Flinkre+$ires no memor' t!res!olds to
config$reFlink manages its own memor'Flinkre+$ires no complicated network
config$rations
ipelining engine re+$ires m$c! lessmemor' for data e/c!angeFlinkre+$ires no serialiers to 3e config$redFlink !andles its own t'pe e/traction and
data representation
- ?$ning
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
55/67
55
-.. ?$ning
According to ike Olsen, C!ief Strateg' Officer of
Clo$deraInc. Spark is too kno33' 1 it !as too man't$ning parameters, and t!e' need constant
ad2$stment as workloads, data vol$mes, $ser co$nts
c!ange. 7eference6
!ttp699vision.clo$dera.com9one4platform9?$ning Spark Streaming for ?!ro$g!p$t B' :erard
aas from %irdata. Decem3er )), )8(- !ttp
699www.virdata.com9t$ning4spark9Spark ?$ning6
!ttp699spark.apac!e.org9docs9latest9t$ning.!tml
- ?$ning
http://vision.cloudera.com/one-platform/http://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://spark.apache.org/docs/latest/tuning.htmlhttp://spark.apache.org/docs/latest/tuning.htmlhttp://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://vision.cloudera.com/one-platform/http://vision.cloudera.com/one-platform/7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
56/67
56
-.. ?$ning
7$n locall' on a data
sample
on t!e laptop7$n a mont! later
after t!e data evolved
#as! vs. Sort
artition vs. Broadcast
Cac!ing
7e$sing partition9sortE/ec$tion
lan A
E/ec$tion
lan B
7$n on large files
on t!e cl$ster
E/ec$tion
lan C
&!at is A$tomatic Optimiation" ?!e s'stemJs 3$ilt4in
optimier takes care of finding t!e 3est wa' to
e/ec$te t!e program in an' environment.
erformance
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
57/67
57
. erformance
&!' Flink provides a 3etter performance"C$stom memor' manager0ativeclosed4loop iteration operators makegrap!
and mac!ine learning applications r$n m$c! faster.7ole of t!e 3$ilt4in a$tomatic optimier. For
e/ample6 more efficient 2oin processing.
ipeliningdata to t!e ne/t operator in Flink is moreefficient t!an in Spark.
See 3enc!marking res$lts against Flink !ere6!ttp699www.slides!are.net9s3altagi9w!'4apac!e4flink4is4t!e4-g4of43ig4da
ta4anal'tics4frameworks9
http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/877/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
58/67
Agenda
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
59/67
59
Agenda
I. otivation for t!is talk
II. Apac!e Flink vs. Apac!e Spark"
III. #ow Flink is $sed at Capital
One"
I%. &!at are some ke' takeawa's"
III. #ow Flink is $sed at Capital One"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
60/67
60
III. #ow Flink is $sed at Capital One"
&e started o$r 2o$rne' wit! Apac!e Flink at Capital
One w!ile researc!ing and contrasting stream
processing tools in t!e #adoop ecos'stem wit! apartic$lar interest in t!e ones providing real4time
stream processing capa3ilitiesand not 2$st micro4
3atc!ing as in Apac!e Spark.
&!ile learning more a3o$t Apac!e Flink, wediscovered some $ni+$e capa3ilities of Flink w!ic!
differentiate it from ot!er Big Data anal'tics tools not
onl' for 7eal4?ime streaming3$t also for Batc!
processing.&e eval$ated Apac!e Flink 7eal4?ime stream
processing capa3ilities in a OC.
III. #ow Apac!e Flink is $sed at Capital One"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
61/67
61
p p
&!ere are we in o$r Flink 2o$rne'"S$ccessf$l installation of Apac!e Flink 8. in o$r
re4rod$ction cl$ster r$nning on CD# ;.- wit!sec$rit' and #ig! Availa3ilit' ena3led.
S$ccessf$l installation of Apac!e Flink 8. in a (8
nodes 7D cl$ster r$nning #D.
S$ccessf$l completion of Flink OCfor real4timestream processing. ?!e OC proved t!at propriet'
s'stem can 3e replaced 3' a com3ination of tools6
Apac!e Qafka, Apac!e Flink,Elasticsearc! and
Qi3ana in addition to advanced real4time streaming
anal'tics.
III. #ow Apac!e Flink is $sed at Capital One"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
62/67
62
p p
&!at are t!e opport$nities for $sing Apac!e
Flinkat Capital One"(. 7eal4?ime streaming anal'tics
). Cascadingon Flink
*. Flink=s ap7ed$ce Compati3ilit' Na'er
-. Flink=s Storm Compati3ilit' Na'er
;. Ot!er Flink li3raries Gac!ine Nearning
and :rap! processingH once t!e' come
o$t of 3eta.
III. #ow Apac!e Flink is $sed at Capital One"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
63/67
63
p p
Cascading on Flink6 First release of Cascading on Flink was anno$nced
recentl' 3' Data Artisans and Conc$rrent. It will 3es$pported in $pcoming Cascading *.(. Capital One is t!e firstcompan' verif'ingt!is release
on real4world Cascading data flows wit! a simple
config$ration switc! andno code re4work neededL
?!is is a good e/ample ofdoing anal'tics on3o$ndeddata sets GCascadingH $singa stream processor GFlinkH
E/pected advantagesof performance 3oost and less
reso$rce cons$mption. F$t$re work is to s$pport
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
64/67
64
p p
Flink=s compati3ilit' la'er for Storm6&e can e/ec$te e/isting Storm topologies
$sing Flink as t!e $nderl'ing engine.&e can re$seo$r application code G3oltsand
spo$tsH inside Flink programs.
Flink=s li3raries GFlinkNfor ac!ineNearning and :ell'for Narge scale grap!
processingH can 3e $sed along Flink=s
DataStream AI and DataSet AI for o$r end to
end 3ig data anal'tics needs.
Agenda
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
65/67
65
Agenda
I. otivation for t!is talk
II. Apac!e Flink vs. Apac!e Spark"
III. #ow Flink is $sed at Capital One"
I%. &!at are some ke' takeawa's"
III. &!at are some ke' takeawa's"
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
66/67
66
' '0eit!er Flink nor Spark will 3e t!e singleanal'tics
framework t!at will solve ever' Big Data pro3lemL
B' design, Spark is not for real4time stream processingw!ile Flinkprovides a tr$e low latenc' streaming
engine and advanced DataStream AI for real4time
streaming anal'tics.Alt!o$g! Spark is a!ead in pop$larit' and adoption,
Flinkis a!ead intec!nolog' innovation and is growingfast.
It is not alwa's t!e most innovative tool t!at gets t!e
largest market s!are, t!e Flink comm$nit' needs to
take into acco$nt t!e market d'namicsLBot! Sparkand Flinkwill !ave t!eir sweet spots
despite t!eir e too s'ndrome5.
?! k L
7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891
67/67
?!anksL ?o all of 'o$ for attendingL ?o Capital One for giving me t!e
opport$nit' to meet wit! t!e growing
Apac!e Flink famil'. ?o t!e Apac!e Flink comm$nit' for t!e
great spirit of colla3oration and !elp.)8( will 3e t!e 'ear of Apac!e FlinkLSee 'o$ at FlinkForward )8(L