Top Banner

of 29

Cost Aware Fault Recovery in Clouds (IM 2013)

Aug 08, 2018

Download

Documents

assafisr
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    1/29

    COST AWARE FAULT RECOVERYIN CLOUDSAssaf Israel, Danny RazTechnion - Israel Institute of Technology

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    2/29

    FAULTS IN DATACENTERS

    Weve come a long way in terms of server resilience

    Enterprise gra

    Component A

    Compute

    (CPU, RAM, Fans, Net)

    ~

    Storage ~

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    3/29

    FAULTS IN DATACENTERS Typical first year of a new 1800 servers cluster @ Google:

    - thousands of hard drive failures

    ~1000 individual machine failures

    ~3 router failures (have to immediately pull traffic for an hour)

    ~5 racks go wonky (40-80 machines see 50% packet loss)

    ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get ba

    ~1 network rewiring (~5% of machines down over 2-day span)

    ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to com

    ~0.5 overheating (power down most machines in

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    4/29

    FAULTS IN DATACENTERS

    Other factors also contribute to lack of resilience

    Distribution of service disruption evenThe Datacenter as a Computer (200

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    5/29

    RECOVERY

    Most of the time we would like to recover as quickly as p

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    6/29

    RECOVERY

    Most of the time we would like to recovery as quickly as Single host recovery may take advantage of vacant re

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    7/29

    RECOVERY

    Most of the time we would like to recovery as quickly as Single host recovery may take advantage of vacant re

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    8/29

    RECOVERY

    Larger failures (Racks, Network segments, Power regionsMay require powering more machines

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    9/29

    RECOVERY

    Larger failures (Racks, Network segments, Power regionsMay require powering more machines

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    10/29

    RECOVERY COST

    ServiceDegradation

    BackupInfrastructure

    RecoveryCost

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    11/29

    RECOVERY COST

    ServiceDegradation

    BackupInfrastructure

    RecoveryCost

    ,

    ,,

    , - Service deg. cost of when recovered at - Infrastructure cost of

    , , - 0/1 Decision vectors

    Can be formally expressed as:

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    12/29

    RECOVERY COST

    Service degradation depends on: Task setup/initialization

    Host setup/initialization

    Network configuration (if recovered to a different network segme

    Storage mapping

    Storage migration (if recovered to a different SAN)

    Software patches

    Integrity checks

    Manual host configuration

    Recovery target location (latency/bandwidth)

    ServiceDegradation

    BackupInfrastructure

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    13/29

    RECOVERY COST

    Pre-planning can help reduce recovery cost

    Activating additional backup infrastructure: Can help lowering some of Service Degradation costs

    At the expense of additional maintenance costs

    ServiceDegradation

    BackupInfrastructure

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    14/29

    OBSERVATION

    Not all tasks are equal Interactive & vital monitoring

    High-priority non-interactive

    Non-interactive user-facing

    Batch

    Housekeeping tasks

    Some are more susceptible to long downtimes than oth

    Web-scW. Cirne

    Tight SLA

    Relaxed SLA

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    15/29

    GOAL

    We would like to recover expensive tasks faster Balance service degradation and infrastructure costs

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    16/29

    GOAL

    We would like to recover expensive tasks first Balance service degradation and infrastructure costs

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    17/29

    GOAL

    Formal: Minimize the total recovery cost

    Infrastructurecosts

    Service degradationcosts

    Under somepacking constraints

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    18/29

    APPROXIMATION - OVERVIEW

    Integer Program

    LP Relaxation

    Linear

    Transformations ||Light Graphs

    CycleBreaking

    Activation

    RoundingApproximation bounds

    Cost 1 Load

    6

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    19/29

    IF WE HAD MORE INFO

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    20/29

    IF WE HAD MORE INFO

    If we knew which of backup hosts are active we could approximate the Service degradation costs

    Backup

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    21/29

    MINIMUM GENERAL ASSIGNMENT PROB

    Bins, Items Each item have a size, depends on the target bin

    Each item have a cost, depends on the target bin

    Goal:Packall items into bins at minimum cost, under packing c

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    22/29

    MIN-GAP

    Has been studied extensively Known results:

    LP-Based 2-Approx. (Shmoys and Tardos, 1993)

    LP-Based

    -Approx. (Fleischer, Goemans, Mirrokni and Svir Local Ratio-Based 2 -Approx. (Cohen, Katzir and Raz, 2006)

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    23/29

    LOCAL SEARCH

    Iteratively find the next backup machine to activate Stop when theres no improvement in recovery costs

    Backup

    Active host

    Inactive hostBase cost - All backups are

    inactive Next AcFind theactivate

    recover

    is minim(Using it

    Stop conditionIf( < ):

    return last RPElse:

    Activate return +

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    24/29

    SIMULATIONS

    Based on data from IBM Research Compute Cloud (RC

    Several hundreds hosts, with a few thousands VMs

    4 host configurations, 3 VM configurations

    EC2-like SLA policies(higher availability guaranties, at higher rates)

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    25/29

    RECOVERY COST BY RACK SIZE

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1 2 3 6 10 17 34

    Cost[%]

    Rack size (#hosts/rack)

    Normalized Recovery Cost by Rack size

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    26/29

    RECOVERY COST BY VM SLA DISTRIBUTIO

    0

    50000

    100000

    150000

    200000

    250000

    0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    80.1

    0.1

    2

    0.1

    4

    0.1

    6

    0.1

    80.2

    0.2

    2

    0.2

    4

    0.2

    6

    0.2

    80.3

    0.3

    2

    0.3

    4

    0.3

    6

    0.3

    80.4

    0.4

    2

    0.4

    4

    0.4

    6

    0.4

    80.5

    Cost

    SLA Distribution

    2 host racks - Total & Service costs

    20% - Cheap to recover

    80% - Expensive to recover

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    27/29

    RECOVERY COST BY VM SLA DISTRIBUTIO

    0

    50000

    100000

    150000

    200000

    250000

    0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    80.1

    0.1

    2

    0.1

    4

    0.1

    6

    0.1

    80.2

    0.2

    2

    0.2

    4

    0.2

    6

    0.2

    80.3

    0.3

    2

    0.3

    4

    0.3

    6

    0.3

    80.4

    0.4

    2

    0.4

    4

    0.4

    6

    0.4

    80.5

    Cost

    SLA Distribution

    2 host racks - Total & Service costs

    Active

    Servic

    Inactiv

    ServicLocal

    Servic

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    28/29

    CONCLUSION

    Large scale infrastructure mandates fault tolerance tec

    Pre-planning can help reduce recovery cost

    Classifying tasks by SLAs can improve overall recovery c

    LP-Based Load/Cost Approximation with guaranteed pe Local Search heuristic with good practical performance

  • 8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)

    29/29

    THANK YOU !

    Questions ?