Top Banner
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. CPN302 Your Linux AMI: Optimization and Performance (Intro) Thor Nolen, Ecosystem Solutions Architect November 15, 2013
80

Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Sep 08, 2014

Download

Technology

Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

CPN302

Your Linux AMI: Optimization and Performance

(Intro)

Thor Nolen, Ecosystem Solutions Architect

November 15, 2013

Page 2: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Linux In EC2 Is About Choice

• Agnostic

Page 3: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Linux In EC2 Is About Choice

• Agnostic

• Easy to deploy, configure, and update

Page 4: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Instance Type Selection

Page 5: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Instance Type Selection

• Choose but be flexible

Page 6: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Instance Type Selection

• Choose but be flexible

• Be careful running the edge of what your

instance type can handle

Page 7: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Instance Type Consideration

• Linux AMI choice is not just distribution and

version

Page 8: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Instance Type Consideration

• Linux AMI choice is no longer just distribution

and version

• PV or HVM

Page 9: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Virtualization Type

PV

• Operating System is

“aware” of its virtual

environment

• Requires OS

modifications

Page 10: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Virtualization Type

PV

• Operating System is

“aware” of its virtual

environment

• Requires OS

modifications

HVM

• Leverages processor

capabilities to deliver full

virtualization

• Can use an unmodified

Operating System

Page 11: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Virtualization Type

PV

• Operating System is

“aware” of its virtual

environment

• Requires OS

modifications

HVM

• Leverages processor capabilities to deliver full virtualization

• Can use an unmodified Operating System – But PV network and

storage drivers recommended

Page 12: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Enhanced Networking

• i2 and C3

Page 13: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Enhanced Networking

• i2 and C3

• HVM only

Page 14: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Enhanced Networking

• i2 and C3

• HVM only

• Requires download, compile, and install of

drivers

Page 15: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

PV or HVM?

• There are performance differences

Page 16: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

PV or HVM?

• There are performance differences

• Determine your metrics, test, and measure

Page 17: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

PV or HVM?

• There are performance differences

• Determine your metrics, test, and measure

• Application / workload testing will guide which

variant is best for you

Page 18: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Linux Partner Ecosystem in EC2

Page 19: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

CPN302

Page 20: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Your Linux AMI:

Optimization and Performance

Coburn Watson

Manager Cloud Performance, Netflix, Inc.

November 15, 2013

Page 21: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Netflix, Inc.

• World's leading internet television network

• ~ 40 Million subscribers in 40+ countries

• Over a billion hours streamed per month

• Approximately 33% of all US Internet traffic at night

• Recent Notables

• Increased originals catalog

Page 22: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

AMI Performance @ Netflix

Page 23: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Why Tune the AMI?

• @ Netflix: 10’s of 1000’s of instances running globally – “Rising Tide Lifts All Ships”

• Large variability in production workloads – OLTP (majority of REST-based services)

– Batch/Pre-Compute (think movie recommendations…)

– Cassandra

– EVCache (memcached tier)

• Cloud environments have inherent performance variability – Improve resilience to such variability

• Deployment model affords ease of customization

Page 24: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Baking Performance Into the Base

• Aminator – Open Source AMI bakery

• Broad propagation of standard performance tunings – Apache, Tomcat configurations

• Focused application of workload-specific configurations – Primarily kernel and OS optimizations: CPU Scheduling, Memory Management, Network, IO

Page 25: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Linux Kernel Tuning - Benefits

• Effectively drive key instance resource dimensions

• Improved efficiency at scale saves big $

• Tuning process drives identification of ideal instance type

• Readily available advanced Linux tools (e.g. perf, systemtap)

provide deep insight into the kernel and the application: – Top Down Analysis: Review of application interaction with system resources

– Bottom Up Analysis: System resource usage of the application

Page 26: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Kernel Tuning Trade-Offs • Kernel subsystems are inter-dependent

– tuning in one area my improve efficiency at the expense of another

• 80/20 Rule:

80%: Improvement gained by application refactoring and tuning

20%: OS tuning, infrastructure improvement etc..

• Tuning tailors the system for a specific workload – Other workload may perform worse

Tuning objective is to align system resources to application requirements in order to improve overall system performance.

Page 27: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Linux Performance Tools

Page 28: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Metrics of Interest

• Performance analysis focus

Resource Characteristic

CPU utilization, saturation, process priority, affinity, NUMA

Memory physical/virtual memory usage, swapping, page cache

Network IO network stack congestion, latency, throughput

Block IO block layer and device latency, throughput, file system

Scalability concurrency, parallelism, shared resources, lock contention

Page 29: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Basic Tools

Tool Description

vmstat, dstat Reports system-wide CPU utilization, saturation, memory and swap usage.

Overview of kernel events like: syscall, context switch, interrupts etc.

mpstat Reports per-CPU utilization. hard/soft interrupts, virtualization overhead

(%steel, %guest)

top, atop, htop,

nmon

Reports per process/thread state, scheduling priorities and CPU usage etc.

atop is similar to top but keeps historical data for trend analysis. htop and

nmon provides similar stats with graphical view.

iostat IO latency/throughput at the driver and the block layer. Device utilization

sar Keeps historical data about CPU, memory, Network, IO usage

uptime Reports CPU saturation - Threads waiting for CPU

Page 30: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Basic Tools, cont. Tool Description

free Free memory and swap. Counts page cache memory as free

/proc/meminfo Memory, swap and file system statistics. Kernel memory usage, statistics for

conservative memory allocation policy, HugeTLB etc..

pidstat Per process/thread CPU usage, context switch, memory, swap, IO usage

ps, pstree Per process/thread CPU and Memory usage

/proc, /sys File system /proc: stats about process, threads, scheduling, kernel stacks, memory etc..

/sys: Report device specific stats: disk, NIC etc..

netstat, iptraf TCP/IP statistics , routing, errors, network connectivity, and NIC stats.

iptraf shows real time tcp/ip network traffic

nicstat, ping, ifconfig NIC stats, network connectivity, netmask, subnet etc..

Page 31: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Advanced Tools

Tool Description

blktrace Profile the Linux Block layer and reports events like: merge, plug, split, remapped etc.

Reports PID, block number, IO size, timestamp etc..

slabtop Kernel memory usage and statistics for various kernel caches in use by kernel

pmap Dumps all memory segments in the process address space: heap, stack, mmap

pstack, jstack Dumps application user level stack trace. jstack contains java methods, tid, pid, threads

states

iotop per process/thread IO statistics. Reports application time spend blocking on IO

/proc/net/softnet_stats per CPU backlog queue throttling (netdev_max_backlog) stats.

/proc/interrutps

/proc/softirqs

Tells which CPU is processing device interrupts. softirqs provides information about softirq

processing for network stack and Linux block layer

tcpdump, wireshark Network sniffer. Capture network traffic (libpcap format) for post analysis. Wireshark can be

used to analyze tcpdump and ethereal traces

ethtool NIC low level statistics: NIC speed, full duplex, transmit descriptors , ring buffer

Page 32: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Advanced Tools, cont.

Tool Description

perf Application and kernel profiling and tracing tool. Reports top kernel and

application CPU bound routines, stack traces. Capture hardware events (cpu

cache, TLB misses etc.), software (kernel, application) static and dynamic

events to perform low level profiling and tracing.

systemtap Application and kernel profiling and tracing tool. Allow inserting trace points in

the kernel and application dynamically to capture low level profiling data for

performance analysis and debugging. Scripting language similar to C and Perl.

latencytop Kernel blocking events due to lock, IO, condition variable . Dumps kernel stack

strace Report information about system calls generated by the application: Type of

system call, arguments, return value, errno, and elapsed time.

numastat numa related latency stats on the HVM platform

Page 33: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

AMI Tuning

Page 34: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Use Case: CFS scheduler tuning

• Goal: – Improve batch and compute-intensive processing:

• Increase time slice and/or process priority in order to reduce context switches

• Longer the time process runs on the CPU, better the use of CPU caches

• Tunables: – Change scheduling policy of workload: # chrt –a –b –p 0 <PID>

OR

– Set CFS tunables to improve time slice at a system-wide level

• sched_latency_ns: 6ms * (1 + log2(ncpus))

Ex: 4 CPU cores = 18ms. Set it higher

• sched_min_granularity_ns: 0.75 * (1 + log2(ncpus))

Ex: 4 CPU cores = 2.25ms. Set it higher

Page 35: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Use Case: CFS scheduler tuning

Page 36: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Use Case: Page Cache Tuning

• Goal: – Increase application write throughput

– Reduce IO flooding by writing consistently rather than in bulk

• Tunables: – dirty_ratio= 60

– dirty_background_ratio= 5

– dirty_expire_centisecs= 30000

– swappiness=0

• Page cache hit/miss ratio: – systemtap (ioblock_request, vfs_read probes).

– fincore command can be used to find what pages of a file are in page cache

Page 37: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Use Case: Linux Block Layer Tuning

• Goal: – Queue more data to SSD device to achieve higher throughput

– Better sequential read IO throughput by fetching more data

– Distribute IO processing across multiple CPUs

• Tunables:

/sys/block/<dev>/queue/nr_requests=256 /sys/block/<dev>/read_ahead=256

/sys/block/<dev>/queue/scheduler=noop /sys/block/<dev>/queue/rq_affinity=2

Page 38: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Use Case: Memory Allocation Tuning

• Goal: – Avoid running out of memory while running a production load

– Do not allow memory over-commit that may result in OOM

• Tunable

– overcommit_memory=2

– overcommit_ratio=80

Page 39: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Use Case: Network Stack Tuning

• Goal:

– Increase Network Stack Throughput

– Larger TCP receive and Congestion window

– Scale network stack processing across multiple CPUs

• Tunable

tcp_slow_start_after_idle=0 rmem_max,wmem_max = 16777216 or higher

tcp_fin_timeout=10 tcp_wmem, tcp_rmem

8388608,1258291,16777216 or higher

tcp_early_retrans=1 rps_sock_flow_entries=32768

netdev_max_backlog=5000 /sys/class/net/eth?/queues/rx-

0/rps_flow_cnt=32768

txqueuelen=5000 /sys/class/net/eth?/queues/rx-0/rps_cpus=0xf

Page 40: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Netflix AMI Tuning Roadmap

Page 41: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Future tuning activity

• M3 class instances supports both HVM and PV. Easy validation of performance gain

with HVM versus PV

• Study Cassandra workload on SSD-based systems

• Tune Linux Block Layer and compare performance of different IO schedulers: noop,

CFQ, deadline

• Test file system: XFS, EXT4, BTRFS performance on various workload running on

SSD instances.

• Test network performance with new TCP/IP and Network Stack features: TCP early

retransmit, TCP Proportional Rate Reduction, and RFS/RPS features

• Capture low level performance metrics using perf, systemtap, and JVM profiling tools

Page 42: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

CPN302

Page 43: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Appendix:

Perf and SystemTap

Page 44: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Profiling and Tracing Benefits

• Fine grain measurements and low level statistics to help

with difficult to solve performance issues

• Isolate hot spots, resource usage and contention in

application and kernel

• Gain comprehensive insight into application and kernel

behavior

Page 45: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap and Perf Benefits

• Inserts trace points into the running application and kernel

without adding any debug code

• Lower overhead, processing done in the kernel space. No

stopping/starting the application

• Help build custom tools to fill out observability gaps

• Analyze throughput and latency across application and all

kernel subsystems

• Unified view of user (application) and kernel events

Page 46: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap and Perf Benefits (cont.)

SystemTap and Perf can track all sorts of events at system-wide, process and thread levels: • Time spent in system call and kernel functions; arguments passed, return

values, errno.

• Dump application and kernel stack trace at any point in the execution path

• Time spend in various process states: blocking for IO, lock, resource and waiting for CPU

• Top CPU bound user and kernel functions

• Low level TCP stats. Not possible with standard tools.

• Low IO and Network activities. Page cache hit/miss rates.

• Monitor page faults, memory allocation, memory leaks

• Aggregate results when large amount of data needs to be collected and analyzed

Page 47: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Perf and SystemTap packages

• Perf: – apt-get update

– apt-get install linux-tools-common

– apt-get install linux-base

– apt-get install linux-tools-$(uname –r)

• SystemTap: Install kernel debug packages and kernel header exactly matching your kernel version – kernel debug packages: http://ddebs.ubuntu.com/pool/main/l/linux/

– apt-get install kernel-headers-$(uname –r)

– apt-get systemtap

Page 48: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap and Perf Events

Perf and SystemTap capture events generated from various sources:

• Hardware Events (perf only): If running on bare-metal system, perf can

access hardware events generated by PMU (performance monitoring Unit)

Examples: CPU cache/TLB loads, references and misses, IPC (cpu stall

cycles), Branch etc..

• Software Events: Events like: page-faults, cpu-clock, context switches etc..

• Static Trace Events: These are trace points coded into entry and exit of

kernel functions: Examples; syscalls, net, sched, irq, etc..

• Dynamic Trace Events: These are dynamic trace points that can be inserted

on-the-fly (hot patching) into application and kernel functions via break point

engine (kprobe). No kernel and application debug compilation, pauses etc..

Page 49: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf: sub commands

Page 50: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf top: Top User and Kernel Routines

Page 51: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf top -G

Page 52: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf stat: Hardware Events

Page 53: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf stat: Software Events

Page 54: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf stat: Net Events

Page 55: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf probe: Add a New Event

Page 56: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf record - Record Events

Page 57: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf report – Process Recorded Events

Page 58: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf record – Record Specific Events

Page 59: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf report – Dump Full Stack Traces

Page 60: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

perf programming

Page 61: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap • SystemTap supports scripted language similar to C and Perl and

follows an event-action model: – Event: Trace or Probe point of interest

• Example: system calls, kernel functions, profiling events etc..

– Action: What to do when event of interest occur

• Example: Print app-name, PID whenever write() syscall is invoked

• Idea behind a SystemTap is to name an event (probe) and provide a handler to perform action in the event context – probe point is like a break point but instead of stopping

kernel/application at the break point, SystemTap causes a branch (jump) to probe handler routine to perform the action.

• Script can have multiple probes and associated handlers. Data is accumulated in buffer and then dump it out into standard out.

Page 62: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap – Runs as a Kernel Module

• When systemtap script is executed, it is converted into .c file and compiled as a linux kernel module (.ko)

• Module is loaded into the kernel and probes are inserted by hot patching the running kernel and application

• Module is unloaded when <cntl><c> is pressed or exit() probe is invoked from the module.

• Systemtap script use file extension (.stp) and contains probe and handler written in the format:

probe event { statements}

• When run as a script, first line should have interpreter: #!/usr/bin/env stap

• Or run from the command line: # stap –e script.stp

Page 63: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap: Events

SystemTap trace points can be placed at various locations in kernel:

– syscall: system call entry and return • Example: syscall.read, syscall.read.return

– vfs: VFS functions entry and return

– kernel.function: Kernel function entry and return • Example: kernel.function(“do_fork”), kernel.function(“do_fork”).return

– module.function: Kernel module entry and return

• Other events: – begin: event fires at the start of script

– end: event fires when script exit

– timer: event fires periodically.

Page 64: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

SystemTap: Functions Commonly used functions: • tid():The ID of the current thread.

• uid(): The ID of the current user.

• cpu(): The current CPU number.

• gettimeofday_s(): The number of seconds since UNIX epoch (January 1, 1970)

• probefunc(): Probe function

• pid(): PID

• execname: Executable name

• thread_ident(): Provide indentation to nicely format printing of function call entry and return

• target(): specify the pid on the command line

• print_backtrace(): Print the complete stack trace

• print_regs(): print CPU registers

• kernel_string(). Useful to print char type in data structures

Page 65: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Appendix:

AMI TUNING

Page 66: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

CFS Scheduler Tuning

• CFS scheduler: – Provides fair share of CPU resources to all running tasks

– Tasks are assigned weights (priority) to control the time a task can

run on the CPU.

• Involuntary context switch: A task has consumed its time slot or is pre-

empted by higher priority task

• Task voluntary relinquishes the CPU when it blocks on a resource: IO

(disk, net), locks..

• CFS supports various scheduling policies: FIFO, BATCH, IDLE,

OTHER (default), RR

Page 67: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

CFS Tunable – Compute Intensive Workload

• Performance goal of Batch workload is to complete the given task in

the shortest time possible. SCHED_BATCH policy is more

appropriate for batch processing workloads

• Task running with SCHED_BATCH policy gets bigger time-slice and

thus does not get involuntary context switched as frequently and

that allows computed tasks to run longer and gets better use of CPU

caches.

Page 68: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

CFS Tunable – Compute Intensive Workload

CFS tunables can also be set to reduce context switching activity:

• sched_latency_ns: period in which each runnable task should run once. Larger value offers bigger CPU slice, that may improve compute performance. Interactive application performance may suffer

Default: 6ms * (1 + log2(ncpus)). Example: 4 CPU cores = 18ms (default). Change it to 36 ms

• sched_min_granularity_ns: Threshold on minimum amount of CPU cycles each task should get. Larger value helps compute workload.

Default: 0.75 * (1 + log2(ncpus)). Example: 4 CPU cores: 2.25ms (default). Change it to 5ms

Internal Testing at Netflix shows 2-5% performance improvement of compute intensive tasks when running the workload with SCHED_BATCH policy as compared to SCHED_OTHER.

Page 69: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Avoid OOM Killer To overcome memory and swap shortages the Linux kernel may kill random processes to

free memory. This mechanism is called Out-Of-Memory Killer.

Tunable Discussion

Heuristic overcommit

overcommit_memory=0 (default)

Allows to overcommit some reasonable amount of memory as determined

by free memory, swap and other heuristics. No reservation of memory and

swap. Thus memory and swap may run out before application uses all of its

memory. This may result in application failure due to OOM.

Always overcommit

overcommit_memory=1

Allow wild overcommit. Any size of memory allocation (malloc) will be

successful. As in the case of Heuristic, memory and swap may run out and

trigger OOM killer.

Strict overcommit

overcommit_memory=2

overcommit_ratio=80

Prevents overcommit. It does not count free memory or swap when making

decisions about commit limit. When application calls malloc(1GB), kernel

reserves or deducts 1G from free memory and swap. This guarantees that

memory committed to application will be available if needed. This prevents

OOM due to no overcommit allowed.

Page 70: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Avoid OOM Killer (continue..) • When strict overcommit is enforced, total memory that can be allocated system-wide is

restricted to:

overcommit Limit = Physical Memory x overcommit_ratio + swap

where: overcommit_ratio=50% (default). Tune overcommit_ratio = 80%

• New program may fail to allocate memory even when the system is reporting plenty of free

memory and swap. This is due to memory and swap reserved on behalf of the process.

• This feature does not effect memory use by file system page cache. Page cache memory is

always counted as free.

• Use “/proc/meminfo” statistics to monitor memory already been committed.

CommitLimit : Total amount of memory that can be allocated system-wide

Committed_AS: Memory already been committed on behalf of application

MemoryAvailable: CommitLimit - Committed_AS

Any attempt to allocate memory over “MemoryAvailable” will fail when strict overcommit is used.

Page 71: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Tuning for Higher Throughput

Tunable Discussion

dirty_ratio Throttle writes when dirty pages in the file system cache reaches to

40%. For write intensive workload increase it to 60-80%

dirty_background_ratio Wakes up pdflush when dirty pages reach 10% of total memory.

Reducing the value (5%) allows pdflush to wake up early and that

may keep dirty pages growth in check

dirty_expire_centisecs Data can stay dirty in the page cache for 30 secs. Increase it to

60-300 seconds on large memory systems to prevent heavy IO to

the storage due to short deadline. Drawback of tuning is that

unexpected outage may result in loss of data not committed.

swappiness Controls Linux periodic swapping activities. Large value favors

growing page cache by steeling application in-active pages. Setting

value to zero disables periodic swapping. Large value may improve

application write throughput. Value of zero is recommended for

latency sensitive application

Page 72: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Linux Block Layer – IO Tuning

• sysfs (/sys) is used to set device specific attributes (tunables):

/sys/block/<dev>/queue/..

• nr_requests: Limits number of IO requests queued per device to 128. To improve IO throughput consider doubling this value for RAID (multiple disks) devices or SSD.

• scheduler: VM instances use Xen virtualization layer and thus have no knowledge of underlying geometry of disks. noop IO scheduler is recommended considering it is FIFO and has least overhead.

• read_ahead: Improves sequential IO performance. Larger value mean fetch more data into page cache to improve application IO throughput.

nr_requests

noop

IO scheduler

Page 73: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Block Layer: IO Affinity

• Linux IO Affinity feature distributes IO processing work across multiple CPUs

• When the application blocks on IO, the kernel records the CPU and dispatches IO.

When the IO is marked completed by the storage driver, the block layer performs IO

processing on the same CPU that has originally issued the IO.

• This feature is very helpful when dealing with high IOPS rates such as SSD systems

given the IO completion processing will be distributed across multiple CPUs.

Tunable Discussion

rq_affinity = 1 (default) Block layer will migrate IO completion to the CPU group that

originally submitted the request

rq_affinity = 2 Forces the IO completion on the CPU that originally issued the

IO. Thus bypass the “group” logic. This option maximizes

distribution of the IO completion

Page 74: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

RPS/RFS - Network Performance and Scalability

• RPS (Receive Packet Steering) and RFS (Receive Flow Steering) can help system to

scale better by distributing network stack processing across multiple CPUs

• Without this feature network stack processing is restricted to the same CPU that

serviced the NIC interrupts, and that may induce latencies and lower the network

throughput

• NIC driver calls netif_rx() to enqueue the packet for processing. RPS function

get_rps_cpu() selects the appropriate queue that should process the request and

thus distributes the work across multiple CPUs.

• RPS make decision by hash lookup that uses CPU bitmask to decide which CPU

should process the packet

• RFS steers the processing to the CPU where the application thread, that eventually

consumes the data, is running. It uses the hash as an index into the network flow

lookup table that maps the flow to the CPUs. This improves CPU cache locality.

Page 75: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

RPS/RFS - Network Performance and Scalability

(continue..)

Tunable Discussion

core.

rps_sock_flow_entries=32768

global flow table containing the desired CPU to flow. Each table value

is a CPU index that is updated during socket calls.

/sys/class/net/eth?/queues/rx-0

rps_flow_cnt=32768

Number of entries in the per-queue flow table. Value of flow is

determined by number of active connections. Setting 32768 is a good

start for moderately loaded server. For a single queue device (as in the

case of AWS instances), the value of two tunables should be the

same.

core.rps_sock_flow_entries should be set in order for it to work.

/sys/class/net/eth?/queues/rx-0

rps_cpus=0xf

It is set as a bitmask of CPUs. Disable when set to zero (means

packets are processed on the interrupted CPU). Set to all CPU or

CPUs that are part of the same NUMA node (large server). Setting

value 0xf will cause CPU 0,1,2,3 to do network stack processing

Page 76: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Network Stack Tuning Packet Transmit Path:

• Network stack converts application payload

written in socket buffer into TCP segments (or

UDP datagrams), calculates the best route

and then writes the packet into NIC driver

queue.

• QOS is provided by inserting various queue

disciplines (FIFO, RED, CBQ..). Queue size is

set to txqueuelen

• NIC driver process packets one-by-one by

writing (DMA) to NIC transmit descriptors. In

case of Xen, packet is written into Xen shared

IO ring (Xen split device driver model)

Page 77: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Network Stack Tuning Packet Receive Path:

• Device writes (DMA) packet into kernel memory and raises interrupt.

• In case of Xen, packet is written into IO shared ring and notification is sent via event channel

• NIC driver interrupt handler copies the packet into input queue (per-cpu queue). Queue is maintained by network stack and its size is set to netdev_max_backlog.

• Packets are processed on the same CPU that received the interrupt. If RPS/RFS feature is enabled then network stack processing is distributed across multiple CPUs

• Packet is eventually written to socket buffer. Application wakes up and process the payload

Page 78: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

TCP Congestion and Receiver Advertise Window

TCP tuning requires understanding of some critical parameters

Paramters Discussion

receiver window size (rwnd)

sender window size (swnd)

congestion window (cwnd)

cwnd controls number of packets a sender can send without needing an

acknowledgment. TCP cwnd starts with 10 segments (slow start) and increase

exponentially until it reaches receiver advertise window size (rwnd). Thus TCP

cwnd will continue to grow if rwnd and swnd are set to a large value. However,

setting rwnd and swnd too large may result in packet loss due to congestion and

this may cut the cwnd to half of rwnd or to TCP slow start value resulting in slower

throughput.

Proportional Rate Reduction (PRR) and Early Retransmit (ER) features (kernel 3.2)

help recover from packet losses quickly by retransmit early and pacing out

retransmission across received ACKs during TCP fast recovery

Bandwidthdelay product (BDP) rwnd and swnd should be set larger than BDP. Otherwise, TCP throughput will be

limited.

BDP = Link Bandwidth * RTT = 1000 * 0.001 sec /8 = 128KB

Socket Buffer size

tcp_wmem, tcp_rmem,

rmem_max, wmem_max

Limits amount data application can send/receive to/from network stack. To improve

application throughput socket size should be set large enough to utilize the TCP

window fully

Page 79: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Network Stack Tunables: Higher Throughput

Tunable Value Discussion

tcp_slow_start_after_idle

(default: 1 means enable)

0 (disable) prevents TCP slow start value (10 segments) to be used as a new advertise

window for connections sitting idle for 3 seconds. Better throughput due to

continue use of receiver advertise window instead of slow start.

tcp_fin_timeout

(default: 60 sec)

10 sec This tunable limits number of connections in TCP TIME_WAIT state to avoid

running out of available ports. Recommended for site with high socket churn

rate and server application initiating connection close.

TIME_WAIT timeout = 2 * tcp_fin_timeout

tcp_early_retrans

(default=0 means disable)

1

(enable)

It allows fast retransmit to trigger after 2 duplicate (instead of 3 or more) ACKs

for the same segment is received. Allows connection to recover quickly due to

packet loss or network congestion. http://research.google.com/pubs/pub37486.html

netdev_max_backlog

(default: 1000 packets)

5000 packets received by NIC driver are queued into per CPU input queue for

network stack processing. Packets will be dropped if input queue is full and

cause TCP retransmits

Page 80: Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013

Network Stack Tunables: Higher Throughput

Tunable Value Discussion

txqueuelen (default: 1000) 5000 Controls amount of data that can be queued by network stack for NIC

driver processing.

For latency sensitive application, consider reducing the value (means

less buffering) so that TCP congestion avoidance kicks in early in case

of packet loss.

rmem_max

wmem_max

16777216

or

higher

Maximum receive and send socket buffer size for all protocols. Set the

same as tcp_wmem and tcp_rmem. It sets the maximum TCP receive

window size. Larger the receive buffer, more data can be sent before

requiring acknowledgement.

Caution: Larger buffer may cause memory pressure

tcp_wmem

tcp_rmem

8388608, 1258291,

16777216

or

higher

Control socket receive and send buffer size. Triplet:

Min: Minimum socket buffer size during memory pressure (default:

4096)

Default: socket buffer size

(receive buffer: 87380 | send buffer: 16384)

Max: Maximum socket buffer size (auto-tuned)