Aws uk ug #8 not everything that happens in vegas stay in vegas

Not everything that happens in Vegas stays in Vegas

DevOpsor “getting devs to be on call for what they ship” :-)

Netflix development

Priorities

1. Speed of innovation

2. Availability

3. Running costs

a. “It’ll cost what it ends up costing”

In practise, they found that holding to the first two ended

up costing way less than otherwise expected.

Riot Games + League of Legends

Cloud == ideal for MMOs. Solve launch issues.

● chef gets used a lot here.

○ talked about their evolution with it, lessons learned

● What sucked?

○ 25 minute bootstrap runs

○ External dependencies (including S3)

○ Duplicating application deployment recipes

● golden masters and immutable servers simplify your

life drastically.

● “if you’re doing chef without BerkShelf you’re doing it

wrong”

● Make it easy to throw up new things

http://berkshelf.com

Testing in production

Netflix, Riot, Kickstarter - they all do this.

At scale.

Netflix

● 10s to 100s of code pushes per day

● 1000s to 100,000s of config changes per day

○ they tune their A/B testing constantly

Of course, they also have the instrumentation to react to

this.

How’re other people doing DevOps?

Good news - we’re at the “more sophisticated” end of the

spectrum.

Every “cloud native” was doing this.

Things other people did better:

● “Golden master” AMIs

● Immutable instances

● Absolute ownership of vertical slices

● Config-managment (chef/puppet) featured

prominently

● Extensive monitoring+logs+visibility == “table stakes”

○ for developers!

● Easy to throw up new things

● Run many small, simple, collaborating things

Who? Riot Games, Netflix, change.org, Kickstarter

Logging aggregation is important

Logging aggregation is important

Lots of 3rd party companies are offering centralized

logging services, there's a huge appetite for logging

and monitoring.

● http://logentries.com/

● http://www.loggly.com/

● http://papertrailapp.com/

● https://www.splunkstorm.com/tour

● http://www.datadoghq.com/

● DIY - Lumberjacking slides

http://logentries.com/

http://logentries.com/

http://www.loggly.com/

http://www.loggly.com/

http://papertrailapp.com/

http://papertrailapp.com/

https://www.splunkstorm.com/tour

https://www.splunkstorm.com/tour

http://www.datadoghq.com/

http://www.datadoghq.com/

http://www.slideshare.net/AmazonWebServices/lumberjacking-on-aws-cutting-through-logs-to-find-what-matters-arc306-aws-reinvent-2013

DEMO: Monitoring & Logging

https://app.datadoghq.com/infrastructure

● Tag Metrics, awesome Metric discoverability

● Cloud Watch integration

○ I never knew I could see ELB metrics :-)

● Alarms are integrated

● You can template Dashboards

https://papertrailapp.com/

● Can Search, Save Searches, Alerts on searches

● No alert on patterns

● Archive to S3 / Push to Redshift

Logging aggregation is FOR DEVELOPERS!!!

Saves lots of time when you’re on call.



http://www.youtube.com/watch?v=tHrT6kQR7vw

http://www.youtube.com/watch?v=tHrT6kQR7vw



Loggly Session

Benefit of logging as a service.

● When your infrastructure is in trouble, you do not

want to have your logging analytic system on the

same infrastructure.

AWS Services that loggly could use:

● Kafka + Storm vs Kinesis

● Elastic Search vs Cloud Search

Predictive Analytics using Storm, Hadoop, R and

AWS

http://www.youtube.com/watch?v=6Sl3eBmDheE



Loggly Session

● Provisioned IOPS solve all issues :)

● ELB do not perform with extremely high volume

of requests.

● DNS round robin is a very good basic load

balancing solution

● Cassandra works very well for application data.

● Cassandra does not work well as a queue system,

hard to track order of events.

● Keep the architecture simple.

Large Scale Load Testing on AWS

Many types of load

● Load testing

○ (running a marathon), predict future load and

plan in advance

● Stress testing

○ Break things (figure out limits), mitigation

plans

● Resilience test

○ Figure out how many parts of the architecture

you can lose and still operate

● Performance test

○ How is latency and throughput changing when

the load increase

Phase roll out and measure

● Load Testing is necessary but not sufficient.

○ Deploy to alpha cluster.

○ The release cycle is important, phased

deployment, one box, monitor and ramp up.

○ Monitor performance and behaviour, look at

99% of the traffic, not at the average.

● Netflix record 1.2 billion metrics per day

○ 5 minutes SLA

Gameday

Gameday

We took part to the AWS Gameday

http://www.awsgameday.com/whatisgameday.html

Inspired by the 2012 Obama For America DevOps

and Amazon.com ops teams

● Build an Autoscaling application

● Exchange administrative IAM credentials with

other team

● Break your opponent's systems

● Restore your system

● Lessons learned



Who is interested if we wanted to run this?

It needs a full day, ~ 6 hours.

Weekday?

Weekend?

Twitter: @petemounce

Aws uk ug #8 not everything that happens in vegas stay in vegas

Documents

types of load load testing

aws http

measure load testing

monitoring logging https

aws gameday http

future load

load increase

redshift logging aggregation