Top Banner
Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica
25

Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Jun 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Effective Straggler Mitigation:

Attack of the Clones

Ganesh Ananthanarayanan, Ali Ghodsi,

Scott Shenker, Ion Stoica

Page 2: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Interactive Data Analytics

• Common in today’s clusters, expected to grow

• Exploratory and experimental jobs

– Data analyst querying small sample (interactive)

• Low latency is crucial for interactive jobs

�Interactive jobs are small

– Facebook: 88% of jobs operate on 20GB of data and

contain fewer than 50 tasks

Page 3: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Stragglers in Small Jobs

• Small interactive jobs are sensitive to stragglers

– Tasks that are much slower than the rest in the job

• Straggler Mitigation:

– Blacklisting: Eliminate machines with faulty hardware

(e.g., erroneous disks)

– Speculation: LATE [OSDI’08], Mantri [OSDI’10]…

• Address the non-deterministic stragglers

Page 4: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Despite the mitigation techniques…

• …in production clusters

�LATE: The slowest task runs 8 times slower* than the median task in a job

�Mantri: The slowest task runs 6 times slower* than the median task in a job

• (but they work well for large jobs…)

*we compare progress-rate of tasks, i.e., input-size/duration

Page 5: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

State-of-the-art Straggler Mitigation

Speculative Execution:

1. Observe: measure relative progress of tasks

2. Speculate: launch speculative copies of

straggler tasks

Page 6: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Why doesn’t this work for small jobs?

1. Consist of just a few tasks

– Statistically hard to predict stragglers accurately

2. Run all their tasks simultaneously

– Observing constitutes large fraction of job’s duration

Observe & Speculate is ill-suited to address

stragglers in small jobs

Page 7: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Cloning Jobs

• Proactively launch clones of a job, just as they

are submitted

• Pick the result from the earliest clone

– Probabilistically mitigates stragglers

• Eschews observe & speculate, causal analysis…

Is this feasible in practice?

Page 8: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Heavy-tailed Distribution

80% of jobs use

3% of resources

Can clone small jobs with few extra resources

• Production clusters for data analytics

Page 9: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Cloning for Stragglers in Small Jobs

• Interactive jobs are important and small

• Hardest for straggler mitigation techniques

– Traditional reactive approach is insufficient

• Heavy-tailed distribution � cloning is feasible

Page 10: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Challenge: Avoid I/O contention

�Every clone should get its own copy of data

• Input data of jobs

– Replicated three times (typically) by file system

• Intermediate data of jobs

– Not replicated at all, to avoid overheads

Page 11: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Job

Strawman: Job-level Cloning

Earliest

� Easy to implement

� Directly extends to any framework

T1T1

T2T2

T2T2

T1T1{ T1 T2 }

Page 12: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Number of clones

• Storage crunch,

can’t replicate more

» 3 clones

• Contention for

input data

Page 13: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Task-level Cloning

Job

T1T1

T1T1

T2T2

T2T2

Earliest

{ T1 T2 } Earliest

Page 14: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

3 clones are plenty!

Strawman Task-level Cloning

Page 15: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Intermediate data reads?

U1U1

U2U2

D1D1

D1D1

U1U1

U2U2One copy of the

intermediate output…

• Jobs consist of DAG of tasks

– Downstream tasks read outputs of upstream tasks

Completed

In-progress

Page 16: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Assign Earliest Copy

U1U1

U2U2

D1D1

D1D1

U1U1

U2U2

Contention Cloning (CC)

Intermediate data transfer

takes longerCompleted

In-progress

Page 17: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

D1D1D1D1

D1D1

U1U1

U1U1

U2U2

U2U2

D1D1

U1U1

U2U2

Jobs are more vulnerable

to stragglers

Assign Exclusive Copy

Contention-Avoidance

Cloning (CAC)

Completed

In-progress

Page 18: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

CAC vs. CC

• CAC avoids contentions but increases

vulnerability to stragglers

– Straggler probability in a job increases by >10%

• CC mitigates stragglers in jobs but creates

contentions

– Intermediate data transfer takes ~50% longer

How to minimize contention without

straggling downstream tasks?

Page 19: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Delay Assignment

�Distinguish intrinsic variations in task

completions from stragglers

• Small delay to get exclusive copy before

contending for the available copy

– Probabilistic model of task durations and read b/w

– (Similar to delay scheduling [EuroSys’10])

• Delay updated automatically and periodically

Page 20: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Dolly: Cloning Jobs

• Task-level cloning of jobs

• Delay Assignment to manage intermediate data

• Works within a resource budget

– Clone only if resources are available

Page 21: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

How effective is Dolly?

• Baselines: LATE or Mantri, + blacklisting

• Cloning budget: 5%

• Workload from Facebook and Bing traces

– 1000’s of nodes, Hadoop and Dryad jobs

• Implemented in Hadoop, 150 node deployment

Page 22: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Average job completion time

Jobs are 44% and 42% faster w.r.t. LATE and Mantri

Effective Mitigation: Slowest task is 1.06x slower

(down from 8x)

Page 23: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Delay Assignment is crucial

1.5x – 2x better

(Exclusive Copy)(Exclusive Copy)(Exclusive Copy)(Exclusive Copy)

(Earliest Copy)(Earliest Copy)(Earliest Copy)(Earliest Copy)

Page 24: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Impact on #phases in job?

• Job DAGs can have many (> 2) phases

Growing gap w.r.t. CAC and CC

Page 25: Effective Straggler Mitigation: Attack of the Clones · 2019-12-18 · Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion

Conclusions

• Traditional straggler mitigation techniques ill-

suited for small interactive jobs

• Dolly: Proactive Cloning of jobs

– Heavy-tail � Small cloning budget (5%) suffices

– Effective Mitigation: eliminates nearly all stragglers

• Power-law + Latency-sensitivity � Cloning