Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• M3 class instances supports both HVM and PV. Easy validation of performance gain
with HVM versus PV
• Study Cassandra workload on SSD-based systems
• Tune Linux Block Layer and compare performance of different IO schedulers: noop,
CFQ, deadline
• Test file system: XFS, EXT4, BTRFS performance on various workload running on
SSD instances.
• Test network performance with new TCP/IP and Network Stack features: TCP early
retransmit, TCP Proportional Rate Reduction, and RFS/RPS features
• Capture low level performance metrics using perf, systemtap, and JVM profiling tools
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
CPN302
Appendix:
Perf and SystemTap
Profiling and Tracing Benefits
• Fine grain measurements and low level statistics to help
with difficult to solve performance issues
• Isolate hot spots, resource usage and contention in
application and kernel
• Gain comprehensive insight into application and kernel
behavior
SystemTap and Perf Benefits
• Inserts trace points into the running application and kernel
without adding any debug code
• Lower overhead, processing done in the kernel space. No
stopping/starting the application
• Help build custom tools to fill out observability gaps
• Analyze throughput and latency across application and all
kernel subsystems
• Unified view of user (application) and kernel events
SystemTap and Perf Benefits (cont.)
SystemTap and Perf can track all sorts of events at system-wide, process and thread levels: • Time spent in system call and kernel functions; arguments passed, return
values, errno.
• Dump application and kernel stack trace at any point in the execution path
• Time spend in various process states: blocking for IO, lock, resource and waiting for CPU
• Top CPU bound user and kernel functions
• Low level TCP stats. Not possible with standard tools.
• Low IO and Network activities. Page cache hit/miss rates.
• Dynamic Trace Events: These are dynamic trace points that can be inserted
on-the-fly (hot patching) into application and kernel functions via break point
engine (kprobe). No kernel and application debug compilation, pauses etc..
perf: sub commands
perf top: Top User and Kernel Routines
perf top -G
perf stat: Hardware Events
perf stat: Software Events
perf stat: Net Events
perf probe: Add a New Event
perf record - Record Events
perf report – Process Recorded Events
perf record – Record Specific Events
perf report – Dump Full Stack Traces
perf programming
SystemTap • SystemTap supports scripted language similar to C and Perl and
follows an event-action model: – Event: Trace or Probe point of interest
• Example: system calls, kernel functions, profiling events etc..
– Action: What to do when event of interest occur
• Example: Print app-name, PID whenever write() syscall is invoked
• Idea behind a SystemTap is to name an event (probe) and provide a handler to perform action in the event context – probe point is like a break point but instead of stopping
kernel/application at the break point, SystemTap causes a branch (jump) to probe handler routine to perform the action.
• Script can have multiple probes and associated handlers. Data is accumulated in buffer and then dump it out into standard out.
SystemTap – Runs as a Kernel Module
• When systemtap script is executed, it is converted into .c file and compiled as a linux kernel module (.ko)
• Module is loaded into the kernel and probes are inserted by hot patching the running kernel and application
• Module is unloaded when <cntl><c> is pressed or exit() probe is invoked from the module.
• Systemtap script use file extension (.stp) and contains probe and handler written in the format:
probe event { statements}
• When run as a script, first line should have interpreter: #!/usr/bin/env stap
• Or run from the command line: # stap –e script.stp
SystemTap: Events
SystemTap trace points can be placed at various locations in kernel:
– syscall: system call entry and return • Example: syscall.read, syscall.read.return
– vfs: VFS functions entry and return
– kernel.function: Kernel function entry and return • Example: kernel.function(“do_fork”), kernel.function(“do_fork”).return
– module.function: Kernel module entry and return
• Other events: – begin: event fires at the start of script
– end: event fires when script exit
– timer: event fires periodically.
SystemTap: Functions Commonly used functions: • tid():The ID of the current thread.
• uid(): The ID of the current user.
• cpu(): The current CPU number.
• gettimeofday_s(): The number of seconds since UNIX epoch (January 1, 1970)
• probefunc(): Probe function
• pid(): PID
• execname: Executable name
• thread_ident(): Provide indentation to nicely format printing of function call entry and return
• target(): specify the pid on the command line
• print_backtrace(): Print the complete stack trace
• print_regs(): print CPU registers
• kernel_string(). Useful to print char type in data structures
Appendix:
AMI TUNING
CFS Scheduler Tuning
• CFS scheduler: – Provides fair share of CPU resources to all running tasks
– Tasks are assigned weights (priority) to control the time a task can
run on the CPU.
• Involuntary context switch: A task has consumed its time slot or is pre-
empted by higher priority task
• Task voluntary relinquishes the CPU when it blocks on a resource: IO
(disk, net), locks..
• CFS supports various scheduling policies: FIFO, BATCH, IDLE,
OTHER (default), RR
CFS Tunable – Compute Intensive Workload
• Performance goal of Batch workload is to complete the given task in
the shortest time possible. SCHED_BATCH policy is more
appropriate for batch processing workloads
• Task running with SCHED_BATCH policy gets bigger time-slice and
thus does not get involuntary context switched as frequently and
that allows computed tasks to run longer and gets better use of CPU
caches.
CFS Tunable – Compute Intensive Workload
CFS tunables can also be set to reduce context switching activity:
• sched_latency_ns: period in which each runnable task should run once. Larger value offers bigger CPU slice, that may improve compute performance. Interactive application performance may suffer
Default: 6ms * (1 + log2(ncpus)). Example: 4 CPU cores = 18ms (default). Change it to 36 ms
• sched_min_granularity_ns: Threshold on minimum amount of CPU cycles each task should get. Larger value helps compute workload.
Default: 0.75 * (1 + log2(ncpus)). Example: 4 CPU cores: 2.25ms (default). Change it to 5ms
Internal Testing at Netflix shows 2-5% performance improvement of compute intensive tasks when running the workload with SCHED_BATCH policy as compared to SCHED_OTHER.
Avoid OOM Killer To overcome memory and swap shortages the Linux kernel may kill random processes to
free memory. This mechanism is called Out-Of-Memory Killer.
Tunable Discussion
Heuristic overcommit
overcommit_memory=0 (default)
Allows to overcommit some reasonable amount of memory as determined
by free memory, swap and other heuristics. No reservation of memory and
swap. Thus memory and swap may run out before application uses all of its
memory. This may result in application failure due to OOM.
Always overcommit
overcommit_memory=1
Allow wild overcommit. Any size of memory allocation (malloc) will be
successful. As in the case of Heuristic, memory and swap may run out and
trigger OOM killer.
Strict overcommit
overcommit_memory=2
overcommit_ratio=80
Prevents overcommit. It does not count free memory or swap when making
decisions about commit limit. When application calls malloc(1GB), kernel
reserves or deducts 1G from free memory and swap. This guarantees that
memory committed to application will be available if needed. This prevents
OOM due to no overcommit allowed.
Avoid OOM Killer (continue..) • When strict overcommit is enforced, total memory that can be allocated system-wide is
restricted to:
overcommit Limit = Physical Memory x overcommit_ratio + swap
• New program may fail to allocate memory even when the system is reporting plenty of free
memory and swap. This is due to memory and swap reserved on behalf of the process.
• This feature does not effect memory use by file system page cache. Page cache memory is
always counted as free.
• Use “/proc/meminfo” statistics to monitor memory already been committed.
CommitLimit : Total amount of memory that can be allocated system-wide
Committed_AS: Memory already been committed on behalf of application
MemoryAvailable: CommitLimit - Committed_AS
Any attempt to allocate memory over “MemoryAvailable” will fail when strict overcommit is used.
Tuning for Higher Throughput
Tunable Discussion
dirty_ratio Throttle writes when dirty pages in the file system cache reaches to
40%. For write intensive workload increase it to 60-80%
dirty_background_ratio Wakes up pdflush when dirty pages reach 10% of total memory.
Reducing the value (5%) allows pdflush to wake up early and that
may keep dirty pages growth in check
dirty_expire_centisecs Data can stay dirty in the page cache for 30 secs. Increase it to
60-300 seconds on large memory systems to prevent heavy IO to
the storage due to short deadline. Drawback of tuning is that
unexpected outage may result in loss of data not committed.
swappiness Controls Linux periodic swapping activities. Large value favors
growing page cache by steeling application in-active pages. Setting
value to zero disables periodic swapping. Large value may improve
application write throughput. Value of zero is recommended for
latency sensitive application
Linux Block Layer – IO Tuning
• sysfs (/sys) is used to set device specific attributes (tunables):
/sys/block/<dev>/queue/..
• nr_requests: Limits number of IO requests queued per device to 128. To improve IO throughput consider doubling this value for RAID (multiple disks) devices or SSD.
• scheduler: VM instances use Xen virtualization layer and thus have no knowledge of underlying geometry of disks. noop IO scheduler is recommended considering it is FIFO and has least overhead.
• read_ahead: Improves sequential IO performance. Larger value mean fetch more data into page cache to improve application IO throughput.
nr_requests
noop
IO scheduler
Block Layer: IO Affinity
• Linux IO Affinity feature distributes IO processing work across multiple CPUs
• When the application blocks on IO, the kernel records the CPU and dispatches IO.
When the IO is marked completed by the storage driver, the block layer performs IO
processing on the same CPU that has originally issued the IO.
• This feature is very helpful when dealing with high IOPS rates such as SSD systems
given the IO completion processing will be distributed across multiple CPUs.
Tunable Discussion
rq_affinity = 1 (default) Block layer will migrate IO completion to the CPU group that
originally submitted the request
rq_affinity = 2 Forces the IO completion on the CPU that originally issued the
IO. Thus bypass the “group” logic. This option maximizes
distribution of the IO completion
RPS/RFS - Network Performance and Scalability
• RPS (Receive Packet Steering) and RFS (Receive Flow Steering) can help system to
scale better by distributing network stack processing across multiple CPUs
• Without this feature network stack processing is restricted to the same CPU that
serviced the NIC interrupts, and that may induce latencies and lower the network
throughput
• NIC driver calls netif_rx() to enqueue the packet for processing. RPS function
get_rps_cpu() selects the appropriate queue that should process the request and
thus distributes the work across multiple CPUs.
• RPS make decision by hash lookup that uses CPU bitmask to decide which CPU
should process the packet
• RFS steers the processing to the CPU where the application thread, that eventually
consumes the data, is running. It uses the hash as an index into the network flow
lookup table that maps the flow to the CPUs. This improves CPU cache locality.
RPS/RFS - Network Performance and Scalability
(continue..)
Tunable Discussion
core.
rps_sock_flow_entries=32768
global flow table containing the desired CPU to flow. Each table value
is a CPU index that is updated during socket calls.
/sys/class/net/eth?/queues/rx-0
rps_flow_cnt=32768
Number of entries in the per-queue flow table. Value of flow is
determined by number of active connections. Setting 32768 is a good
start for moderately loaded server. For a single queue device (as in the
case of AWS instances), the value of two tunables should be the
same.
core.rps_sock_flow_entries should be set in order for it to work.
/sys/class/net/eth?/queues/rx-0
rps_cpus=0xf
It is set as a bitmask of CPUs. Disable when set to zero (means
packets are processed on the interrupted CPU). Set to all CPU or
CPUs that are part of the same NUMA node (large server). Setting
value 0xf will cause CPU 0,1,2,3 to do network stack processing
Network Stack Tuning Packet Transmit Path:
• Network stack converts application payload
written in socket buffer into TCP segments (or
UDP datagrams), calculates the best route
and then writes the packet into NIC driver
queue.
• QOS is provided by inserting various queue
disciplines (FIFO, RED, CBQ..). Queue size is
set to txqueuelen
• NIC driver process packets one-by-one by
writing (DMA) to NIC transmit descriptors. In
case of Xen, packet is written into Xen shared
IO ring (Xen split device driver model)
Network Stack Tuning Packet Receive Path:
• Device writes (DMA) packet into kernel memory and raises interrupt.
• In case of Xen, packet is written into IO shared ring and notification is sent via event channel
• NIC driver interrupt handler copies the packet into input queue (per-cpu queue). Queue is maintained by network stack and its size is set to netdev_max_backlog.
• Packets are processed on the same CPU that received the interrupt. If RPS/RFS feature is enabled then network stack processing is distributed across multiple CPUs
• Packet is eventually written to socket buffer. Application wakes up and process the payload
TCP Congestion and Receiver Advertise Window
TCP tuning requires understanding of some critical parameters
Paramters Discussion
receiver window size (rwnd)
sender window size (swnd)
congestion window (cwnd)
cwnd controls number of packets a sender can send without needing an
acknowledgment. TCP cwnd starts with 10 segments (slow start) and increase
exponentially until it reaches receiver advertise window size (rwnd). Thus TCP
cwnd will continue to grow if rwnd and swnd are set to a large value. However,
setting rwnd and swnd too large may result in packet loss due to congestion and
this may cut the cwnd to half of rwnd or to TCP slow start value resulting in slower
throughput.
Proportional Rate Reduction (PRR) and Early Retransmit (ER) features (kernel 3.2)
help recover from packet losses quickly by retransmit early and pacing out
retransmission across received ACKs during TCP fast recovery
Bandwidthdelay product (BDP) rwnd and swnd should be set larger than BDP. Otherwise, TCP throughput will be