Top Banner
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Frank Chen, Coursera Brennan Saeta, Coursera October 2015 CMP406 Amazon ECS at Coursera Powering a general-purpose near-line execution microservice, while defending against untrusted code
78

(CMP406) Amazon ECS at Coursera: A general-purpose microservice

Apr 16, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Frank Chen, Coursera

Brennan Saeta, Coursera

October 2015

CMP406

Amazon ECS at CourseraPowering a general-purpose near-line execution

microservice, while defending against untrusted code

Page 2: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What to Expect from the Session

• Techniques for a unified near-line, batch, and scheduled

micro-service powered by Amazon ECS

• Security vulnerabilities and countermeasures when

running untrusted code in Docker with Amazon ECS

• Reasons to modify the Amazon ECS agent

Page 3: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Session Outline

• Introduction to Coursera

• Near-line, batch and scheduled job execution framework

• Motivations and background

• Amazon ECS benefits and limitations

• Iguazú and its architecture

• Evaluating programming assignments

• System requirements

• Security threat model

• Attacks and defenses

Page 4: (CMP406) Amazon ECS at Coursera: A general-purpose microservice
Page 5: (CMP406) Amazon ECS at Coursera: A general-purpose microservice
Page 6: (CMP406) Amazon ECS at Coursera: A general-purpose microservice
Page 7: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Education at Scale

15 million learners worldwide

2.5 millioncourse completions

1,300+courses

125+partners

Page 8: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

A unified execution framework

Page 9: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Batch Processing Enables…

Reporting

Instructor Reports

• Grade exports

• Learner demographics

• Course progress

statistics

Internal Reports

• Business metrics

• Payments

reconciliation

Page 10: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Scheduled Processing Enables…

Marketing

• Recommendation emails

• Targeted marketing / reactivation emails

Page 11: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Nearline Processing Enables…

Pedagogical Innovations

• Peer-review matching & analysis

• Auto-graded programming assignments

Page 12: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

The early days…

January 2012

Page 13: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Bad Old Days of Batch Processing @ Coursera

Cascade

• PHP-based job runner

• Originally ran in screen sessions

• Polled APIs for new jobs

• Forced restarts on regular basis

due to unidentified memory leaks

• Fragile and unreliable

The early

days…

Page 14: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Bad Old Days of Batch Processing @ Coursera

Saturn

• Scala scheduled batch job runner• Powered by Quartz Scheduler library

• Better than Cascade, but…

• All jobs ran on same JVM, causing

interference

The not-

so early

days?

Page 15: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Looking for something better…

Page 16: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 17: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 18: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 19: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 20: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 21: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 22: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What Else Did We Look At?

Home-grown Tech

• Tried, but proved

to be unreliable

• Difficult to

handle

coordination and

synchronization

• Powerful, but

hard to

productionize

• Needs

developers with

experience

• Designed for

GCE first

• Not a managed

service, higher

Ops load

Page 23: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Amazon ECS to the Rescue

Amazon re:Invent 2014 – Dr. Werner Vogels introducing Amazon ECS

Screenshot from https://www.youtube.com/watch?v=LE5uBqNp2Ds by Amazon Web Services

Page 24: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Amazon ECS to the Rescue

Little

maintenance

Integrated with

rest of AWSEasy to

develop for

Page 25: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Amazon ECS to the Rescue

Little

maintenance

Integrated with

rest of AWSEasy to

develop for

Page 26: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Amazon ECS to the Rescue

Little

maintenance

Integrated with

rest of AWSEasy to

develop for

Page 27: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

However…

Amazon ECS is a great building block,

but we still need to build tools around it

for our purposes.

Page 28: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Built: Iguazú

Marissa Strniste (https://www.flickr.com/photos/mstrniste/5999464924) CC-BY-2.0

• Batch Job Scheduler for Amazon ECS

• Immediately

• Deferred (run once at X time)

• Scheduled recurring (cron-like)

• Programmatically accessible internally via

our standard APIs and clients

• Named for Iguazú falls

• World’s largest waterfall by volume

• We hope Iguazú handles a similar volume of jobs

Page 29: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 30: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 31: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 32: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 33: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 34: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 35: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 36: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú

Frontend

Iguazú

SchedulerIguazú

Backend

Iguazú: Architecture

CassandraServices Services

Iguazú

Admin

ECS

Workers

SQS

ECS API

Devs

Users

Page 37: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Developing Iguazú Jobs

class Job extends AbstractJob with StrictLogging {

override val reservedCpu = 1024 // 1 CPU core

override val reservedMemory = 1024 // 1 GB RAM

def run(parameters: JsValue) = {

logger.info("I am running my job! ")

expensiveComputationHere()

}

}

Page 38: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Running Jobs from Other Services

// invoking a job with one function call

// from another service via Naptime RPC/REST framework

val invocationId = IguazuJobInvocationClient

.create(IguazuJobInvocationRequest(

jobName = "exportQuizGrades",

parameters = quizParams))

Page 39: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Iguazú: Developer / Ops User Interface

Page 40: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Deploying Jobs

Easy Deployment

1. Developers Merge into master. Done!

Jenkins Build Steps:

1. Builds zip package from master

2. Prepares Docker image with zip file

3. Pushes image into Docker registry

4. Registers updated jobs with

Amazon ECS API

Page 41: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Logs

• Logs are in /var/lib/docker/containers/*

• Upload into log analysis service (Sumologic)

• Wrapper prints out job name and job ID

at the start for easy searching

• Good enough for now

Page 42: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Metrics

• Using third-party metrics collector (Datadog)

• Metrics for both jobs and container instances

• So long as the worker machines can talk to Internet,

things will work out pretty well

Page 43: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Since April 2015…

65 jobs in

production

>1000 runs

per day

44 different

scheduled jobs

Page 44: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Evaluating

Programming Assignments

Page 45: (CMP406) Amazon ECS at Coursera: A general-purpose microservice
Page 46: (CMP406) Amazon ECS at Coursera: A general-purpose microservice
Page 47: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Programming Assignments at Coursera

Page 48: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

The Security Challenge

Compiling and running untrusted, arbitrary code in

Amazon EC2

Would you like to compile and run C code from random

people on the Internet on your servers?

Page 49: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

1st Generation System

Class graders in

separate AWS acct

Custom grader systems

on cloud providers

Course grader under the

instructor’s desk

Learners Coursera Servers Queue Service

Page 50: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

1st Generation System: Weaknesses

No Auto Scaling No standard security Graders crashed

Page 51: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

1st Generation System: Weaknesses

No Auto Scaling No standard security Graders crashed

Page 52: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

1st Generation System: Weaknesses

No Auto Scaling No standard security Graders crashed

Page 53: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 54: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 55: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 56: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 57: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Threat Model

Prevent submitted code from:

• impacting the evaluation of other submissions.

• disrupting the grading environment (e.g., DoS)

• affecting the rest of the Coursera learning platform

Additional goals:

• Minimize exfiltration of information

• Test cases, solutions, etc…

• Minimize risk of submissions changing own scores

• Avoid turning into bitcoin miners or part of botnet

Page 58: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Threat Model - Assumptions

• Run arbitrary binaries

• Instructor grading scripts may have vulnerabilities

• ∴ Grading code is untrusted

• Unknown vulnerabilities in Docker and Linux name-

spacing and/or container implementation

Page 59: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attack / Vulnerability Classes

Divided into 2 main categories:

• Assuming basic containers are secure, prevent any

negative impacts to running arbitrary code.

• Assuming basic container technology is vulnerable,

mitigate negative impacts as much as possible.

Page 60: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

What We Built: GrID

Patrick Hoesly (https://www.flickr.com/photos/zooboing/5665221326/) CC-BY-2.0

• Service + architecture for grading

programming assignments

• Builds on Amazon ECS and Iguazú

• Named for Tron’s “digital frontier”

• Backronym: Grading Inside Docker

Page 61: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS APIs

Grading MachinesVPC Firewalls

Coursera Production Account Coursera GrID Grading Account

Page 62: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS APIs

Grading MachinesVPC Firewalls

Coursera Production Account Coursera GrID Grading Account

Page 63: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS API

Grading MachinesVPC Firewalls

Production Acct GrID Grading Account

Page 64: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS API

Grading

Machines

VPC

Firewalls

Production Acct GrID Grading Account

Page 65: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attacks: Resource Exhaustion

Defenses:

• Docker / CGroups:

• CPU quotas

• Memory limits

• Swap limits

• Hard timeouts for container execution

• btrfs limits

• file system storage quotas

• IOPS throttling

Page 66: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attacks: Kernel Resource Exhaustion

Defenses:

• Open file limits per container (nofile)

• nproc Process limits

• Limit kernel memory per cgroup

• Limit execution time

Page 67: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attacks: Network attacks

Attacks:

• Bitcoin mining

• DoS attacks on third-party systems

• Access Amazon S3 and other AWS

APIs

Defense:

• Deny network access

Page 68: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Modifying the ECS Agent: Network Modes

• NetworkDisabled too restrictive

• Some graders require local loopback

• Feature also deprecated

• --net=none + deny net_admin + audit network• Isolation via Docker creating an

independent network stack for each

container

• github.com/coursera/amazon-ecs-agent

Page 69: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attacks: Namespace / Container Vulnerabilities

• App Armor & Mandatory Access Control

• Required modifying the Amazon ECS Agent

• Allows auditing or denying access to a

variety of subsystems

• Drop capabilities

• No need for NET_BIND_SERVICE, CAP_FOWNER

• No root within container

Page 70: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attacks: Root escalations within the container

• We modify instructor grader images

before allowing them to be run

• Clears setuid

• Inserts C wrapper to drop privileges from

root and redirect stdin/stdout/stderr

• Required Amazon ECS Agent

modification

• Grant root privileges

• Map Docker socket into Docker

containers to run Docker in Docker!

Page 71: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Attacks: If all else fails…

• Utilizes VPC security measures to

further restrict network access

• No public internet access

• Security group to restrict

inbound/outbound access

• Network flow logs for auditing

• Separate AWS account

• Run in an Auto Scaling group

• Regularly terminate all grading EC2

instances

Page 72: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Other Security Measures

• Utilize AWS CloudTrail for audit logs

• Third-party security monitoring

(Threat Stack)• No one should log in, so any TTY is an alert

• Penetration testing by third-party red

team (Synack)

Page 73: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Technique: Co-process

• Environment has no network, but has to

get submissions in and results out

• Python co-process watches Amazon ECS

/ Docker

• Python co-process then:• Mounts a shared folder containing submission

• Reads back the grade from the shared folder

after container exits

• Monitors and cleans up

Page 74: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Future Improvements

• Priority queues for different grading

priorities

• Re-grades vs on-demand grades

• Better instructor tooling

• Automated “unit-testing” for new graders

• Better simulation of production

environment on instructor machines

• Support scheduling GPUs

Page 75: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Lessons Learned

• Run the latest kernels

• Latest security patches

• btrfs wedging on older kernels

• Default Ubuntu 14.04 kernel not new

enough!

• Carefully monitor disk usage

• Docker-in-docker can’t clean up after

itself (yet).

• Reliable deploy tooling pays for itself

Page 76: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Related Sessions

Also from Coursera:

• BDT404 - Building and Managing Large-Scale ETL Data

Flows with AWS Data Pipeline and Dataduct - Friday

Containers and Amazon ECS:

• CMP302 - Amazon EC2 Container Service: Distributed

Applications at Scale – Next timeslot in Venetian H

Page 77: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Thank you!

Questions?

Also, we are hiring!

www.coursera.org/jobs

tech.coursera.org

Brennan Saetagithub/saeta

@bsaeta

[email protected]

Frank Chengithub/frankchn

@frankchn

[email protected]

Page 78: (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Remember to complete

your evaluations!