Top Banner
NICTA Copyright 2012 From imagination to impact Architecture Tactics for Large-Scale Systems to Manage Changes Len Bass
43

Architectural Tactics for Large Scale Systems

Jan 27, 2015

Download

Technology

Len Bass

How to manage consistency, failure, continuous deployment, and upgrade in the cloud
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Architecture Tactics for Large-Scale

Systems to Manage Changes

Len Bass

Page 2: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact2

About NICTA

National ICT Australia

• Federal and state funded research company established in 2002

• Largest ICT research resource in Australia

• National impact is an important success metric

• ~700 staff/students working in 5 labs across major capital cities

• 7 university partners• Providing R&D services, knowledge

transfer to Australian (and global) ICT industry

NICTA technology is in over 1 billion mobile

phones

Page 3: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

WICSA 2014 is in Sydney!!

Working IEEE/IFIP Conference on Software Architecture (WICSA) is the pre-eminent software architecture conferenceApril 7-11, 2014

Page 4: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Traditional View of Large Scale Systems

4

Application

Cloud Environment

Traditionally, the software engineering community has viewed systems as being developed for users and existing in an environment. The motivating questions have been: With this world view: how can development costs be reduced and run time quality improved?

End users

Developers

Page 5: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

A Broader View

5

Application

Cloud Environment

Applications are not only affected by the behavior of the end users but also by actions of operators who control the environment for a consumer’s application.

ConsumerOperator

End users

Developers

Page 6: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

My Message: Applications must respond to change caused by the environment and the operators as well as new processes used during development.

Application

Cloud Environment

ConsumerOperator

End users

Developers

.

Page 7: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Applications must be aware of

7

• Failure and its causes• Consistency issues• Continuous deployment practices• Multiple simultaneous versions active

• The remainder of this talk will discuss why applications should have this kind of awareness and what tactics are used to address the problems.

Page 8: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Failure and its causes

8

A year in the life of a Google data center (from Jeff Dean)• ~0.5 overheating (power down most machines in <5 mins, ~1-2

days to recover)• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours

to come back)• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to

get back)• ~5 racks go wonky (40-80 machines see 50% packetloss)• ~12 router reloads (takes out DNS and external vips for a couple

minutes)• ~3 router failures (have to immediately pull traffic for an hour)• ~dozens of minor 30-second blips for dns• ~1000 individual machine failures• ~thousands of hard drive failures• slow disks, bad memory, misconfigured machines, flaky machines,

etc.

Page 9: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Consequence for cloud consumers

9

• Failure is pervasive.• Cloud as a whole is reliable (99.5% availability)

but any particular physical component is not.• This means applications must be aware of the

possibility of virtual machine failure.• Applications must be constructed to be fault

tolerant.

Page 10: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Detection of fault

10

• Two techniques– Heartbeat – component sends periodic messaging

indicating that it is alive– Timeout – client of component sets a deadline after

which• Component will be assumed to have failed.• Messages will be assumed to have gotten lost

• Netflix (US video streaming service) advocates fast failure.– Clients set short timeout.– Results in better response time if component failed– May result in “false positive” whereby component is

assumed to have failed but, in reality, is still alive.– If client retries request, it may be executed twice.

Page 11: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Recovery from fault

11

• Redundancy of computation and data– Redundancy of data will be discussed in next section on

consistency– Redundancy of computation is typically achieved by making

services stateless.• Can send failed messages to new instance. Need to be

concerned about second execution if first message was, in fact, acted on

• Can instantiate new copy of service if failure is caused by overloading.

• Alternative means for accomplishing service– Some services can be accomplished in using different

mechanisms. Consider one mechanism as a fallback to a primary.

– Degraded service might be possible.

Page 12: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Undo

• After performing an operation in AWS, may want to go back to original state – i.e. Undo the operation

• Not always that straight-forward:– Attaching volume is no problem while the instance is

running, detaching might be problematic– Creating / changing auto-scaling rules has effect on

number of running instances• Cannot terminate additional instances, as the rule would

create new ones!

– Deleted / terminated / released resources are gone!

12

Page 13: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Undo using transaction approach

13

+ commit+ pseudo-delete

begin-transaction rollback

dododo

Administrator

Page 14: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Approach

14

begin-transaction rollback

dododo

Sense cloud resources states

Sense cloud resources states

Administrator

Undo System

Page 15: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Approach

15

begin-transaction rollback

dododo

Sense cloud resources states

Sense cloud resources states

Administrator

Undo System

Goal stateGoal state

Initial state

Initial state

Page 16: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

begin-transaction rollback

dododo

Sense cloud resources states

Sense cloud resources states

PlanGenerate codeExecute

Administrator

Undo System

Goal stateGoal state

Initial state

Initial state

Set of actionsSet of

actions

Approach

16

Page 17: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Report fault

17

• Through logs.– Correlating logs can be difficult– Tracking logs to root causes can be very difficult.

• Through reporting to parent service.– It, in turn, may have alternative means of achieving its

goals, including undo.

Page 18: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Consistency issues

18

• Data is frequently replicated.– NoSQL data bases all replicate data

• Replication takes time.– Means that inconsistent versions of data may exist

• One (or more) that has been updated• One (or more) that has not yet received the updates.

– Leads to phenomenon known as “eventual consistency”

– May take ½ second to become consistent.

Page 19: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact19

Characterising Eventual Consistency in Amazon SimpleDB

• The probability to read updated data in SimpleDB in US West– An application reads data X (ms) after it has written data

• SimpleDB has two read operations– Eventual Consistent

Read– Consistent Read

• This pattern is consistent regardless of the time of day

Eventual ConsistentConsistent Read

Page 20: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Other types of inconsistency

• Configuration parameters– All instances should have same settings in terms of

security, locality, etc.• Synchronization locks. Locks shared across distributed

instances may not be in a consistent state.• One mechanism is to have consistency manager.

– Complicated since centralized consistency manager may fail and distributed consistency managers must be coordinated.

– Zookeeper is an open source tool that manages consistency for distributed applications at a small cost in latency.

20

Page 21: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Continuous deployment practices

• Many organizations have developers deploy after changes tested– Google– Amazon– Linkedin– Netflix

• Leads to following types of problems– Multiple simultaneous versions active– Errors occurring during installation

21

Page 22: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Various Upgrade Strategies

• How many at once?– One at a time (rolling upgrade)– Groups at a time (staged upgrade, e.g. canaries)– All at once (big flip)

• What happens to old versions?– Replaced en masse– Maintained for some period for compatibility purposes

22

Page 23: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Services Can be Bundled in Two Fashions

• Tightly Coupled– Google– Facebook

• Loosely Coupled– Amazon– Linkedin

23

Page 24: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Tightly Coupled Services

• Deployment unit is tier• A tier bundles multiple services into one virtual

machine

• Tier 1

• Tier 2

24

Page 25: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Loosely Coupled Services

• Deep service dependency hierarchy – may be 70 deep

• Upgrading one service in this hierarchy

• Need to consider both service and its clients

• Each service is a Virtual Machine

25

Figure from Netflix Tech Blog

Page 26: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Comparing Two Options

• Both options provide for horizontal scaling based on load

• Both options provide for failure recovery– Tightly coupled option will replace tier– Loosely coupled option will replace service– Failure recovery assumes stateless Virtual Machines

• Differ– How updates and canaries are managed (I will

discuss in a moment)– How unwanted dependencies are avoided

• Tightly coupled option depends on developer discipline• Loosely coupled option avoids unwanted dependencies

through information hiding.26

Page 27: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Common upgrade strategy

• Require all versions to be backward compatible with previous versions

• Require changes associated with new version to be software switchable.

• Clients of a service must be version aware in order to know whether to utilize new functionality.

• Once all instances have been upgraded to new versions, send signal to turn on changes both in the new version and their clients.

• When using canaries only turn on changes for a subset of services and their clients.

27

Page 28: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Current state of major internet provider

• Each service has an owner• Every service instance is instrumented• When a canary is deployed, service owner

examines monitoring data (next slide) and uses judgment to decide when to move to production.

• Canary testing is currently based on functionality. No stress testing of canaries.

28

Page 29: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Netflix Monitoring Sequence

29

• Client outbound (start/end)• Network (start/end)• Service network (inbound start/end)• Service processing (start/end)• Service outbound (start/end)• Network (start/end)• Client inbound (start/end)

Page 30: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

General picture for version aware loosely coupled services

Client

Top Level load

balancer

Second level load balancer

Server for Version A

Server for Version A

Server for Version B

Second level load balancer

Server for Version A

Server for Version B

30

Client• Version aware

• Must know about new versions In order to take advantage of new functionality

• May be implicitly version aware based on, e.g. cluster

• Version unaware clients will only use old functionality and these can be served by any server since services are backward compatible.

In addition:• Load variation may

trigger elasticity rules. • Deciding whether to

load new version or old version raises other issues.

Page 31: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Canary Issues

• Canaries are a form of live testing. Put a new version into limited production to test its correctness.

• Issues– How long are new versions tested to determine

correctness?• Period based – for some period of time • Load based – under some utilization assumptions • Result based – until some criteria is met

– How are clients of new version chosen and how is this choice enforced?

– How are the canaries deployed?

31

Page 32: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Use of canaries with tightly coupled services

• Version awareness does not need to extend to load balancers– Services and clients are bound into VM– Services and clients that are used to test new version

are in single VM and have no need for version aware load balancers.

32

Page 33: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

More Detail on Upgrade Process

• Canaries are deployed and allowed to run for a period without turning on new features.

• This is to test backward compatibility.• Once canaries pass this test, then the new

features are turned on.

33

Page 34: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Installation Motivating Scenario

• You change the operating environment for an application– Configuration change– Version change– Hardware change

• Result is degraded performance • When the software stack is deep with portions

from different suppliers, the result is frequently:

34

Page 35: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Why is Installation Error Prone?

• Installation is complicated.– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for

Linux are ~250 pages each– Apache description of addresses and ports (one out of 16

descriptions) has following elements:• Choosing and specifying ports for the server to listen to• IPv4 and IPv6• Protocols• Virtual Hosts

– The number of configuration options that must be set can be large

• Hadoop has 206 options• HBase has 64

– Many dependencies are not visible until execution

35

Page 36: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Installation Processes

• Processes may be– Undocumented– Out of date– Insufficiently detailed

• Our goal is to build process model including error recovery mechanisms

36

Page 37: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Our Activities

37

• Create up to date process models for installation processes. Information sources are– Process discovery from logs– Process formalization from existing written

descriptions.

• Process descriptions can be used to– Make trade offs – Make recommendations in real time to operations

staff– Recommend setting checkpoints for potential later

undo, before a risky part of a process is entered– Assist in the detection of errors

Page 38: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Hard Problems

38

• Creating accurate process models– Exception handling mechanisms are not well

documented– Noisy logs– Our approach

• Top down modeling using process modeling formalism• Bottom up process mining from error logs

• Diagnosing errors

Page 39: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Why is Error Diagnosis Hard?

In a distributed computing environment, when an error occurs during operations, it is difficult and time consuming to diagnosis it.Diagnosis involves correlating messages from• different distributed servers• different portions of the

software stack and determining the root cause of the error.The root cause, in turn, may be within a portion of the stack that is different from where the error is observed.

Page 40: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Test Bed

40

Our current test bed is the Hbase stack

Page 41: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Currently Performing Analysis of Configuration Errors

41

• Cross stack errors may take hours to diagnose– Log files are inconsistent– Error message may not give context necessary to

determine root cause.

Page 42: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

Summary

42

• The modern cloud environment and modern development practices have introduced new problems or made more important old problems.

• Tactics exist to deal with some of these problems.

• Developing tactics for other problems is a matter of research.

Page 43: Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact

NICTA Team

• Anna Liu• Alan Fekete• Min Fu• Jim Zhanwen Li• Qinghua Lu• Sherif Sakr• Hiroshi Wada• Ingo Weber• Xiwei Xu• Liming Zhu

43