Elas%c Memory: Bring Elas%city Back to In6Memory Big Data Analy%cs May 18, 2015 Byung6Gon Chun Seoul Na%onal University (cmslab.snu.ac.kr) Joint work with Joo Seong Jeong, Woo6Yeon Lee, Yunseong Lee, Youngseok Yang, Brian Cho
Elas%c'Memory:'Bring'Elas%city'Back'to'In6Memory'Big'Data'Analy%cs'
May'18,'2015''
Byung6Gon'Chun''Seoul'Na%onal'University'
(cmslab.snu.ac.kr)'Joint'work'with'Joo'Seong'Jeong,'Woo6Yeon'Lee,''
Yunseong'Lee,'Youngseok'Yang,'Brian'Cho'
Elas%c'Big'Data'Analy%cs'Jobs'• MapReduce/DAG'jobs'execute'on'a'run%me'that'supports'elas%c'scale6out'execu%on'
• Dis%nct'MapReduce/DAG'jobs'run'together'on'a'shared'cluster,'thus'improving'u%liza%on'
• New'types'of'in6memory'data'analy%cs'do'not'fit'well'to'this'model'– The'interac%ve'query'system'does'not'share'resources'even'when'the'system'is'idle'
New'Types'of'In6Memory'Data'Analy%cs:'Interac%ve'Query'
Query'master'
Worker'
Client Client
Client Client
The'Case'for'Elas%city:''Interac%ve'Query'
• Scale6out'– The'workers'may'spill'data'to'disks'when'they'do'not'have'enough'memory'resources'=>'expand'memory'resources'to'perform'in6memory'processing'
• Scale6in'– The'workers'hold'on'to'their'resources'even'while'they'remain'idle'during'periods'without'client'queries'=>'shrink'resources'to'mi%gate'reduced'cluster'u%liza%on'
New'Types'of'In6Memory'Data'Analy%cs:'Machine'Learning'
Itera%ve''computa%on'
Master'
Worker'
The'Case'for'Elas%city:''Machine'Learning'
• Scale6in'– The'job'is'communica%on'heavy'=>'shrink'the'number'of'machines'to'reduce'communica%on'overheads'
• Scale6out'– The'job'is'computa%on'heavy'=>'allocate'more'memory'in'other'machines'to'exploit'computa%on'parallelism'
Elas%c'Memory'(EM)'
• Abstrac%on'that'provides'“elas%c'memory”'by'dynamically'expanding'and'shrinking'memory'resources'and'moving'memory'state'– Mechanisms'for'reconfiguring'memory'resources'and'state'
– Policies'for'automa%ng'reconfigura%on'
EM'Architecture'
Computation
Worker
Computation
aaa aaa
Distributed File System
Master
: EM component
: Existing " component
Container Pool
Resource Manager
Metric Tracker
Metric Tracker
Profile
Metric Manager
Metric Tracker
Reconfigure
State State
State Manager
State
Computation
Policy Engine
Con
tain
er
State'Representa%on'Container0
B1 B0
Subset<type_B>
A1 A0
Subset<type_A>
A6 A5
Container1
C3 C2
Subset<type_C>
A3 A2
Subset<type_A>
A4
A1 A0
A2
A4 A3
A5 A6
UNIT type_A
Primi%ves'for'Reconfiguring'State'
add
delete
Container1 Container0 Container0 Container0
resize
State State State State
Container0 State
Primi%ves'for'Reconfiguring'State'
State Container0 Container1 Container0 Container1
Move
State State State
Container0
Stable Storage
Checkpoint
State
Profiling'
• Each'worker’s'metric'tracker'measures'local'metrics'and'sends'them'to'the'metric'manager'
• The'metric'manager'aggregates'and'processes'the'received'metrics''
Policies'• Policy'='Rules'• Rule'='Condi%on,'Ac%ons'• Condi%on'='Func%on(metrics)'• Ac%on'
– Add'<ResourceSpec>'– Delete'<SelectFunc>'– Resize'<SelectFunc>'<ResourceSpec>'– Merge'<SelectFunc>'<n>'– Split'<SelectFunc>'<n>'– Migrate'<SelectFunc1>'<SelectFunc2>'
Elas%c'Interac%ve'Query'with'EM:'Unit,'Metrics'
• Unit:'a'row'of'a'table''• Metrics'
– Requests'for'data'per'second'– Memory'u%liza%on'– Idle'%me'– …'
Elas%c'Interac%ve'Query'with'EM:'Policy'
• Rule'1'(scale'out)'Condi%on:'avg(load)'>'0.8'Ac%on:'Add(resource6spec)'
• Rule'2'(scale'in)'Condi%on:'idle6%me'>'10'min'Ac%on:'Delete(top(idle6%me))'''''
Distributed'Machine'Learning'
• Start'by'loading'data'from'disk'and'storing'it'to'memory;'access'data'in'memory'throughout'the'job'execu%on'
• Iterate'– The'workers'run'the'algorithm'independently'on'its'par%%on'of'the'data'
– The'master'aggregates'the'computa%on'results'and'calculates'a'model.''
– This'model'is'broadcast'to'the'workers'
Elas%c'Machine'Learning'with'EM:'Unit,'Metrics'
• Unit:'an'independent'observa%on'(e.g.,'a'single'number,'vector,'a'matrix)'
• Metrics'– Task'%me'per'itera%on'– Computa%on'%me'per'itera%on'– Communica%on'%me'per'itera%on'– …'
Elas%c'Machine'Learning'with'EM:'Policy'
• Rule1'(straggler'handling)'Condi%on:'is_straggler(task6iter6%me)'Ac%on:'Migrate(top1(task6iter6%me),'bogom1(task6iter6%me)'
• Rule2'(scale6out)''Condi%on:'avg(task6comp6%me/task6comm6%me)'>'TH1'
' 'Ac%on:'Split(top1(task6comp6%me/task6comm6%me),'2)'
• Rule3'(scale6in)''Condi%on:'avg(task6comp6%me/task6comm6%me)'<'TH2''Ac%on:'Merge(bogom2(task6comp6%me/task6comm6%me),'2)'
Elas%c'Machine'Learning'Framework'
ML'Abstrac%on'
ML'Algorithms'
Elas%c'Memory'
Apache'REEF'
Op%mizer/Run%me'
Elas%c'Comm'
(hgp://reef.apache.incubator.org,''SIGMOD'2015)'
Meta6framework'for'big'data'systems'
Current'Status'
• Building'Elas%c'Memory'on'Apache'REEF'
• Building'a'new'Elas%c'Machine'Learning'Framework'that'runs'on'Elas%c'Memory'
• Exploring'SparkSQL6like'engines'to'work'with'Elas%c'Memory''
Thank'you!'Q'&'A'