The Cloud Story or Less is More...

The Cloud Story or Less is More…

by Slava Vladyshevsky slava[at]verizon.com

Dedicated to Lee, Sarah, David, Andy and Jeff as well as many others,

who went above and beyond to make this possible.

“Cache is evil. Full stop.” Jeff

Table of Content PART I – BUILDING TESTBED ...................................................................................................................................... 6 PART II – FIRST TEST .................................................................................................................................................... 10 PART III – STORAGE STACK PERFORMANCE ..................................................................................................... 12 PART IV – DATABASE OPTIMIZATION .................................................................................................................. 15 PART V – PEELING THE ONION ................................................................................................................................ 24 PART VI – PFSENSE ........................................................................................................................................................ 25 PART VII – JMETER ........................................................................................................................................................ 27 PART VIII – ALMOST THERE ...................................................................................................................................... 28 PART IX – CASSANDRA ................................................................................................................................................. 29 PART X – HAPROXY ........................................................................................................................................................ 34 PART XI – TOMCAT ........................................................................................................................................................ 40 PART XII – JAVA ............................................................................................................................................................... 42 PART XIII – OS OPTIMIZATION ................................................................................................................................. 44 PART XIV – NETWORK STACK .................................................................................................................................. 44 Figure Register AWS Application Deployment ...................................................................................................................... 6 Initial VCC Application Deployment .......................................................................................................... 9 First Test Results -‐ Comparison Chart .................................................................................................. 10 First Test -‐ High CPU Load on DB Server ............................................................................................. 11 First Test -‐ High CPU %iowait on DB Server ...................................................................................... 11 First Test -‐ Disk I/O Skew on DB Server .............................................................................................. 11 Optimized Storage Subsystem Throughput ........................................................................................ 14 AWS i2.8xlarge CPU load -‐ Sysbench Test Completed in 64.42 sec .......................................... 16 VCC 4C-‐28G CPU load -‐ Sysbench Test Complete in 283.51 sec ................................................. 16 InnoDB Engine Internals ............................................................................................................................. 17 Optimized MySQL DB -‐ QPS Graph .......................................................................................................... 20 Optimized MySQL DB -‐ TPS and RT Graph .......................................................................................... 20 Optimized MySQL DB -‐RAID Stripe I/O Metrics ............................................................................... 21 Optimized MySQL DB -‐ CPU Metrics ...................................................................................................... 21 Optimized MySQL DB -‐ Network Metrics ............................................................................................. 22 Jennifer APM Console ................................................................................................................................... 25 Initial Application Deployment -‐ Network Diagram ....................................................................... 25 Jennifer XView -‐ Transaction Response Time Scatter Graph ...................................................... 26 Jennifer APM -‐ Transaction Introspection ........................................................................................... 26 Iterative Optimization Progress Chart .................................................................................................. 28 Jennifer XView -‐ Transaction Response Time Surges ..................................................................... 29 VCC Cassandra Cluster CPU Usage During the Test ......................................................................... 30 AWS Cassandra Cluster CPU Usage During the Test ....................................................................... 31 High-‐Level Cassandra Architecture ........................................................................................................ 32 Jennifer APM -‐ Concurrent Connections and Per-‐server Arrival Rate .................................... 35 Jennifer APM -‐ Connection Statistics After Optimization .............................................................. 35 Jennifer APM -‐ DB Connection Pool Usage .......................................................................................... 41 JVM Garbage Collection Analysis ............................................................................................................. 42

JVM Garbage Collection Analysis – Optimized Run ......................................................................... 43 XEN PV Driver and Network Device Architecture ........................................................................... 45 Recommended Network Optimizations ............................................................................................... 47 Last Performance Test Results ................................................................................................................. 52 Table Register Major Infrastructure Limits .......................................................................................................................... 7 AWS Infrastructure Mapping and Sizing ................................................................................................. 7 VCC Infrastructure Mapping and Sizing .................................................................................................. 8 Optimized MySQL DB -‐ Recommended Settings ............................................................................... 22 Optimized Cassandra -‐ Recommended Settings ............................................................................... 33 Network Parameter Comparison ............................................................................................................ 49

PREFACE One of the market leading enterprises, hereinafter called Customer, has multiple business units working in various areas, ranging from consumer electronics to mobile communications and cloud services. One of their strategic initiatives is to expand software capabilities to get on top of the competition. The Customer started to use AWS platform for development purposes and as the main hosting platform for their cloud services. Over past years the usage of AWS grew significantly with over 30 production applications currently hosted on AWS infrastructure. While Customer reliance on AWS increased, the number of pain points grew as well. They experienced multiple outages and had to spend unnecessary high costs to grow application performance and to accommodate unbalanced CPU/Memory hardware profiles. Although achieved application performance was satisfactory in general, several major challenges and trends emerged over time:

-‐ Scalability and growth issues -‐ Very high overall infrastructure and support costs -‐ Single service provider lock-‐in.

Verizon proposed to trial the Verizon Cloud Compute (VCC) beta product as an alternative hosting platform with the goal to demonstrate that on par application performance can be achieved at a much lower cost, effectively addressing one of the biggest challenges. An alternative hosting platform would give the Customer a freedom of choice, thus addressing another issue. Last, but not least, the unique VCC platform architecture and infrastructure stack built for low latency and high performance workloads would definitely help to address another pain point – application performance and scalability. Senior executives from both companies supported this initiative and one of the Customer’s applications was selected for the proof of concept project. The objective was to compare side-‐by-‐side AWS and VCC deployments from both capability and performance perspectives, execute performance tests and deliver report to senior management. The proof of concept project has been successfully executed in a close collaboration between various Verizon teams as well as Customer’s SMEs. It was demonstrated that the application hosted on the VCC platform, given appropriate tuning, is capable of delivering better performance than when hosted on a more powerful AWS based footprint.

PART I – BUILDING TESTBED The agreed high-‐level plan was clear and straightforward:

• (Verizon) Mirror AWS hosting infrastructure using VCC platform • (Verizon) Setup Infrastructure, OS and Applications per specification sheet • (Customer) Adjust necessary configurations and settings on VCC platform • (Customer) Upload test data – 10 million users, 100 million contacts • (Customer) Execute smoke, performance and aging test in AWS environment • (Customer) Execute smoke, performance and aging test in VCC environment • (Customer) Compare AWS and VCC results and captured metrics • (Customer) Deliver report to senior management

The high-‐level diagram below is depicting the application infrastructure hosted on AWS platform.

Figure 1: AWS Application Deployment

Although both AWS and VCC platforms are using XEN hypervisor in their core, the initial step – mirroring AWS hosting environment by provisioning equally sized VMs in

VCC raised first challenge. Verizon Cloud Compute platform in its early beta stage has imposed number of limitations. To be fair, those limitations were not by design, nor hardware limits, rather software or configuration settings pertinent to corresponding product release. The table below summarizes most important infrastructure limits for both cloud platforms as of February 2014: Resource Limit VCC AWS VPUs per VM 8 32 RAM per VM 28 GB 244 GB Volumes per VM 5 20+ IOPS per Volume (SSD) 3000 4000 Max Volume Size 1 TB 1 TB Guaranteed IOPS per VM 15K 40K Throughput per vNIC 500 Mbps 10 Gbps

Table 1: Major Infrastructure Limits

Besides obvious points, like the number of CPUs or huge difference in network throughput, it’s also worth mentioning that the CPU/RAM – processor count to memory size ratio is quite different as well -‐ 1:4.5 for VCC and 1:7.625 for AWS correspondingly. This ratio is crucial for certain types of applications, specifically for databases. Despite aforementioned differences, it was jointly decided with the Customer to move forward with smaller VCC VMs and consider sizing ratio while comparing performance and test results. This already set the expectation that VCC results might be lower comparing to AWS, assuming linear application scalability and 4-‐8 times hardware footprint differences. The table below summarizes infrastructure sizing and mapping for corresponding service layers hosted on both cloud platforms. Resources sized differently on the corresponding platforms are highlighted. VM Role AWS VM Profile

Count VPUs RAM, GB IOPS Net, Mbps

Tomcat 2 4 34.2 -‐ 1000 MySQL 1 32 244 10K 10000 Cassandra 8 8 68.4 5K 1000 HA Proxy 4 2 7.5 -‐ 1000 DB Cache 2 4 34.2 -‐ 1000

Table 2: AWS Infrastructure Mapping and Sizing

VM Role VCC VM Profile

Count VPUs RAM, GB IOPS Net, Mbps

Tomcat 2 4 28 -‐ 500 MySQL 1 4 28 9K 500 Cassandra 12 4 28 5K 500 HA Proxy 4 2 4 -‐ 500 DB Cache 2 4 28 -‐ 500

Table 3: VCC Infrastructure Mapping and Sizing

The initial setup of the disk volumes required special creativity in order to get as close as possible to the required number of IOPS. In addition to the per-‐disk storage limits mentioned above, initially there was another VCC limitation in place that was luckily addressed later – all disks connected to a particular VM had to be provisioned with the exact same IOPS rate. The most common setup used was based on LVM2 with a linear extension for the boot disk volume group and either two or three additional disks aggregated into an LVM stripe set. This setup allowed setting up disk volumes with up to 3TB size and 9000 IOPS, getting close enough to the required 10K IOPS for database VMs. Besides technical limitations the sheer volume of provisioning and configuration work presented challenge in itself. The hosting platform requirements were captured in a spreadsheet listing system parameters for every VM. Following this spreadsheet manually and building out environment sequentially would have required significant time and tremendous manual effort. Additionally, this may have resulted in a number of human errors and omissions. Automating and scripting major parts of the installation and setup process addressed this. The automation suite implemented based on the vzDeploymentFramework shell library (Verizon internal development), made it possible in a matter of minutes to:

-‐ Parse specification spreadsheet for inputs and updates -‐ Generate updated OS and Application configurations -‐ Create LVM volumes or software RAID arrays -‐ Roll-‐out updated settings to multiple systems based on their functional role -‐ Change of Linux iptables based firewall configurations across the board -‐ Validate required connectivity between hosts -‐ Install required software packages

Having all configurations in version controlled repository allowed auditing and comparing configurations between master and on-‐host deployed versions, providing rudimentary configuration management capabilities. Below is a high-‐level architecture for the originally implemented test environment.

Figure 2: Initial VCC Application Deployment

The test load was initiated by a JMeter Master (Test Controller and Management GUI) and generated by several JMeter Slaves (Load Generators or Test Agents). The generated virtual user (VU) requests were load-‐balanced between two Tomcat application servers each running single application instance. Since F5 LTM instances were not available during the build time, the proposed design utilized pfSense appliances as routers, load-‐balancers or firewalls for corresponding VLANs. The tomcat servers communicated via another pair of HAProxy load-‐balancers with two persistent storage back-‐ends – MySQL (SQL DB) and Cassandra (NOSQL DB), employing Couchbase (DB Cache) as a caching layer.

Most systems were additionally instrumented with NMON collectors for gathering key performance metrics. A Jennifer APM application has been deployed to perform real-‐time transaction monitoring and code introspection. Following the initial plan, the hosting environment was timely handed over to the Customer for adjusting configurations and uploading test data.

PART II – FIRST TEST The first test was conducted on both AWS and VCC platforms and Customer did share the test results. During the test the load was ramped up using 100 VU increments for each subsequent 10 minutes long test run. During each run the corresponding number of virtual users performed various API calls emulating human behavior using patterns observed and measured on the production application. The chart below depicts the number of application transactions successfully processed by each platform during the 10 minutes test runs.

Figure 3: First Test Results -‐ Comparison Chart

It was obvious that the AWS infrastructure is more powerful, processing more than two times higher throughput, which did not come as big surprise. However, Customer expressed several concerns about overall VCC platform stability, low MySQL DB server performance and uneven load distribution between striped data volumes, dubbed as I/O skews.

321

462

539

627 637 645 651 654

203 256 269 257 275

249 268 247

0

100

200

300

400

500

600

700

200 300 400 500 600 700 800 900

TPS per VU count

AWS TPS Verizon TPS

Indeed, application “Transactions per Second” (TPS) measurements did not correlate well with the generated application load and even with a growing number of users something prevented the application from taking-‐off. After short increases overall throughput consistently dropped again, clearly pointing to a bottleneck limiting the transaction stream. According to Jennifer APM monitors the increase in application transaction times was caused by slow DB responses, taking 5 second and more, per single DB operation. At the same time DB server was showing very high CPU %iowait, fluctuating about 85-‐90%.

Figure 4: First Test -‐ High CPU Load on DB Server

Figure 5: First Test -‐ High CPU %iowait on DB Server

Furthermore, out of three stripes, parts of the data volume, one volume constantly reported significantly higher device wait times and utilization percentage, effectively causing disk I/O skews.

Figure 6: First Test -‐ Disk I/O Skew on DB Server

Obviously, these test results were not acceptable. To investigate and identify bottlenecks and performance limiting factors good knowledge of the application architecture and its internals was required as well as deep VCC product and storage stack knowledge, since the latter two issues seemed to be platform and infrastructure related. To address this a dedicated cross-‐team taskforce was established.

PART III – STORAGE STACK PERFORMANCE The VCC Storage Stack was validated once more and it’s been reconfirmed, that there are no limiting factors or shortcomings on layers below block device. The resulting conclusion was that the limitations had to be on the hypervisor, OS, or application layer. On the other hand Customer confirmed that AWS deployment is using exactly the same configuration and application versions as VCC. The only possible logical conclusion was that the setup and configuration optimal for AWS does not perform the same way on VCC. Or in other words, the VCC platform required its own optimal configuration. Further efforts have been aligned with the following objectives:

-‐ Improve storage throughput and address I/O skews -‐ Identify the root cause for low DB server performance -‐ Improve DB server performance and scalability -‐ Work with Customer on improving overall VCC deployment performance -‐ Re-‐run performance tests and demonstrate improved throughput and

predictable performance levels Originally the storage volumes were setup using Customer specifications and OS defaults for other parameters. After performing research and a number of component performance tests, several interesting discoveries were made, in particular:

-‐ Different Linux distributions (Ubuntu and CentOS) are using a different approach to disk partitioning. Ubuntu did align partitions for 4k block sizes, while CentOS did not

-‐ The default block device scheduler CFQ is not a good choice in environments using virtualized storage

-‐ MDADM and LVM volume managers are using quite different algorithms for I/O batching or compaction

-‐ XFS and EXT4 file-‐systems yield very different results depending on the number of concurrent threads performing I/O

-‐ Due to all Linux optimizations and multiple caching levels it’s hard enough to measure net storage throughput from within VM, let alone through the entire application stack

After number of trials and studying platform behavior, the following was suggested for achieving optimal I/O performance on VCC storage stack:

-‐ Use raw block devices instead of partitions for RAID stripes to circumvent any partition block alignment issues

-‐ Use MDADM software RAID instead of LVM (the latter is more flexible and may be used in combination with MDADM, however it does perform certain amount of “optimization” assuming spindle based storage that may interfere with performance in VCC)

-‐ Use proper stripe settings and block sizes for software RAID (don’t let system guess – specify!)

-‐ Use EXT4 file-‐system instead of XFS. EXT4 does provide journaling for meta-‐data and data instead of meta-‐data only with neglectable performance overhead for the load observed.

-‐ Use optimal (and safe) settings for EXT4 file-‐system creation and mounts -‐ Ensure NOOP block device scheduler is used (which lets the underlying storage

stack from the hypervisor down optimize block I/O more effectively) -‐ Separate various I/O profiles, e.g. sequential I/O (redo/bin-‐log files) and random

I/O (data files) for DB server by writing corresponding data to separate logical disks.

-‐ Use DIRECT_IO wherever possible and avoid OS/file-‐system caching (cache may give in certain situations false impression of high performance which is then abruptly interrupted by flushing massive caches during which the entire VM gets blocked)

-‐ Avoid I/O bursts due to cache flushing and keep device queue length close to 8. This corresponds to a hardware limitation on the chassis NPU. In VCC storage is very low latency and quick, but if the storage queue locks up the entire VM gets blocked. Writing early and often at a consistent rate performs dramatically better under load than caching in RAM as long as possible and then flooding the I/O queue when the cache has been exhausted.

-‐ Make sure network device driver is not competing with block device drivers and application for CPU time by relocating associated interrupts to different vCPU cores inside the VM.

-‐ Use 4K blocks for I/O operations wherever possible for more optimal storage stack operation.

After implementing these suggestions on a DB server, storage subsystem yielded predictable and consistent performance. For example, data volumes setup with 10K IOPS, have been reporting ~39MB/s throughput, which is expected maximum assuming 4K I/O block size:

(4K * 10000 IOPS) /1024 = 39.06M, the maximum possible throughput (4K * 15000 IOPS) /1024 = 58.59M, the maximum possible throughput

With 15K IOPS setup using 3 stripes (5K IOPS each) the ~55-‐56MB/s throughput was achieved as shown on the screenshot below:

Figure 7: Optimized Storage Subsystem Throughput

Although some minor I/O figures deviation (+/-‐ 5%) was still observed, this is typically considered acceptable and within normal range. While performing additional tests on optimized systems, it was observed that all block device interrupts are being served by CPU0, which was becoming a hot spot even with netdev interrupts moved off, to a different CPUs. The following method may be used to spread block device interrupts evenly for devices implementing RAID stripes:

# distribute block device interrupts between CPU4-‐CPU7 cat /proc/interrupts cat /proc/irq/183[3-‐6]/smp_affinity* echo 80 > /proc/irq/1836/smp_affinity echo 40 > /proc/irq/1835/smp_affinity echo 20 > /proc/irq/1834/smp_affinity echo 10 > /proc/irq/1833/smp_affinity echo 8 > /proc/irq/1838/smp_affinity

Please note that IRQ numbers and assignment may differ on your system. You have to consult /proc/interrupts table for specific assignments pertinent to your system. For additional details and theory, please refer to the following online materials: http://www.percona.com/blog/2011/06/09/aligning-‐io-‐on-‐a-‐hard-‐disk-‐raid-‐the-‐theory/ https://www.kernel.org/doc/ols/2009/ols2009-‐pages-‐235-‐238.pdf http://people.redhat.com/msnitzer/docs/io-‐limits.txt

PART IV – DATABASE OPTIMIZATION Since Customer didn’t share application and testing know-‐how yet, the only way to reproduce abnormal DB behavior during the test was to replay DB transaction log against recovered from backup DB snapshot. This was slow, cumbersome and not really fully repeatable process. Percona tools were really instrumental for this task allowing multithreaded transaction replay inserting delays between transactions as recorded. A plain SQL script import would have been processed by single thread only and all requests would be processed as one stream. Although the transaction replay has created some DB server load, the load type and its I/O patterns were quite different compared to I/O patterns observed during the test. Transaction logs included only DML statements (insert, update, delete), but no data read (select) requests. Knowing that those “select” requests represented 75% of all requests, it quickly became apparent that such testing approach is flawed and will not be able to recreate real-‐life conditions. We came to a point where more advanced tools and techniques were required for iterating over various DB parameters in a repeatable fashion while measuring their impact on DB performance and underlying subsystems. Moreover, it was not clear whether unexpected DB behavior and performance issues were caused by the virtualization infrastructure, the DB engine settings, or the way DB was used, i.e. combination of application logic and data stored in DB tables. To separate those concerns it was proposed to perform load-‐tests using synthetic OLTP transactions generated by sysbench, a well-‐known load-‐testing toolkit. Such tests have been executed on both VCC and AWS platforms. The results were speaking for themselves.

Figure 8: AWS i2.8xlarge CPU load -‐ Sysbench Test Completed in 64.42 sec

Figure 9: VCC 4C-‐28G CPU load -‐ Sysbench Test Complete in 283.51 sec

At this point it was clear that DB server’s performance issues have nothing to do with application logic and are not specific to SQL workload and rather related to configuration and infrastructure. The OLTP test provided the capability to stress test the DB engine and optimize it independently, without having to rely on Customer’s application know-‐how and the solution wide test harness.

Thorough research and study of InnoDB engine began… Studying source code as well as consulting with the following online resources was a key to a clear understanding of DB engine internals and its behavior:

-‐ http://www.mysqlperformanceblog.com -‐ http://www.percona.com -‐ http://dimitrik.free.fr/blog/ -‐ https://blog.mariadb.org

The drawing below published by Percona engineers is showing key factors and settings impacting DB engine throughput and performance.

Figure 10: InnoDB Engine Internals

Obviously, there is no quick win and no single dial to turn in order to achieve the optimal result. It’s easy to explain main factors impacting InnoDB engine performance, though optimizing those factors practically is a quite challenging task.

InnoDB Performance – Theory and Practice The two most important parameters for InnoDB performance are innodb_buffer_pool_size and innodb_log_file_size. InnoDB works with data in memory, and all changes to data are performed in memory. In order to survive a crash or system failure, InnoDB is logging changes into InnoDB transaction logs. The size of the InnoDB transaction log defines up to how many changed blocks are tolerated in memory for any given point in time. The obvious question is: “why can’t we simply use a gigantic InnoDB transaction log?” The answer is that the size of the transaction log affects recovery time after a crash. The rule of thumb (until recent) was -‐ the bigger the log, the longer the recovery time. Okay, so we have another innodb_log_file_size variable. Let’s imagine it as some distance on imaginary axis:

Our current state is checkpoint age, which is the age of the oldest modified non-‐flushed page. Checkpoint age is located somewhere between 0 and innodb_log_file_size. Point 0 means there are no modified pages. Checkpoint age can’t grow past innodb_log_file_size, as that would mean we would not be able to recover after a crash.

In fact, InnoDB has two safety nets or protection points: “async” and “sync”. When checkpoint age reaches “async” point, InnoDB tries to flush as many pages as possible, while still allowing other queries, however, throughput drops down to the floor. The “sync” stage is even worse. When we reach “sync” point, InnoDB blocks other queries while trying to flush pages and return checkpoint age to a point before “async”. This is done to prevent checkpoint age from exceeding innodb_log_file_size. These are both abnormal operational stages for InnoDB and should be avoided at all cost. In current versions of InnoDB, the “sync” point is at about 7/8 of innodb_log_file_size, and the “async” point is at about 6/8 = 3/4 of innodb_log_file_size.

So, there is one critically important balancing act: on the one hand we want “checkpoint age” as large as possible, as it defines performance and throughput. But, on the other hand, we should never reach the “async” point.

The idea is to define another point T (target), which is located before “async”, in order to have a gap for flexibility, and try at all cost to keep checkpoint age from going past T. We assume that if we can keep “checkpoint_age” in the range 0 – T, we will achieve stable throughput even for more-‐less unpredictable workload.

Now, which factors affecting checkpoint age? When we execute DML queries that change data (insert/update/delete), we perform writes to the log, we change pages, and checkpoint age is growing. When we perform flushing of changed pages, checkpoint age is going down again. So, that means – the main way we have to keep checkpoint age about point “T” is to change the number of pages being flushed per second or make this number variable and suited for specific workload. That way, we can keep checkpoint age down. If this doesn’t help and checkpoint age keeps growing beyond “T” towards “async”– we have a second control mechanism: we can add a delay into insert/update/delete operations. This way we prevent checkpoint age from growing and reaching “async”. To summarize, the idea for the optimization algorithm is: under load we must keep checkpoint age around point “T” by increasing or decreasing the number of pages flushed per second. If checkpoint age continues to grow, we need to throttle throughput to prevent further growth. The throttling depends on the position of checkpoint age – as our checkpoint age gets closer to “async”, we need higher levels of throttling. From Theory to Practice – Test Framework There is a saying -‐ In theory, there is no difference between theory and practice, but in practice there is… In practice, there are a lot more variables to bear in mind. There are also such factors as I/O limits, thread contention and locking coming into play and improving performance is becoming more like solving equation with a number of variables, which are depending on each other… Obviously, for being able to iterate over various parameter and setting combinations there is a need to execute DB tests in a repeatable and well-‐defined (read automated) manner, while capturing test results for correlation and further analysis. Quick research showed that although there are many load-‐testing frameworks available, with some being specifically tailored for testing MySQL DB performance, unfortunately, none of them would cover all requirements and provide needed tools and automation. Eventually, we developed our own fully automated and flexible load-‐testing framework. This framework was mainly used to stress test and analyze MySQL and InnoDB

behavior, nonetheless, it is open enough to plug in any other tools or to be used for testing different applications. The developed toolkit includes following components:

-‐ Test Runner -‐ Remote Test Agent (load generator) -‐ Data Collector (sampler) -‐ Data Processor -‐ Graphing facility

Using this framework it was possible to identify the optimal MySQL and InnoDB engine configuration. The goal was to deliver best possible InnoDB engine performance in terms of transactions and queries served per second (TPS and QPS) while eliminating I/O spikes and achieving consistent and predictable system load, in other words fulfilling the critically important balancing act mentioned above: keeping “checkpoint age” as large as possible at the same time trying not to reach the “async” (or even worse “sync”) point. The graphs below show that an optimally configured DB server can easily deliver 1000+ OLTP transactions, translating to 20+K queries per second, generated by 500 concurrent DB connections during a 6 hour long test.

Queries per second (QPS) – green

Figure 11: Optimized MySQL DB -‐ QPS Graph

After a warm-‐up phase the system consistently delivered about 22K queries per second.

Transactions per second (TPS) – green Response Time (RT) - blue

Figure 12: Optimized MySQL DB -‐ TPS and RT Graph

After ramping up load up to 500 concurrent users, the system consistently delivered 1200 TPS in average. The response time 1600ms average is measured end to end and includes both network and communication overhead (~1000ms) and SQL processing time (~600ms).

%util - red await - green avgqu-sz - blue

Figure 13: Optimized MySQL DB -‐RAID Stripe I/O Metrics

It’s easy to see that after the warm-‐up and stabilization phases the disk stripe performed consistently utilizing an average disk queue size ~8, which was suggested by the storage team as the optimum value for VCC storage stack. The “await” iostat metric is constantly below 20ms , which is the average time for I/O requests to be issued to the device and to be served. Device utilization is <25% in average, showing that there is still plenty of spare capacity to serve I/O requests.

%idle – red %user - green %system - blue %iowait - yellow

Figure 14: Optimized MySQL DB -‐ CPU Metrics

The CPU metrics are showing that in average 55% of CPUs were idle, 35% were spent in user space, i.e. executing applications, 5% were spent by kernel (or system) tasks including interrupt processing and just 5% were spent waiting for device I/O.

bytes sent - green bytes received - blue

Figure 15: Optimized MySQL DB -‐ Network Metrics

The network traffic measurement suggests that network capacity is fully consumed, or using other words – network is saturated with ~48 MB/s sent and ~2 MB/s received. These 50 MB/s of accumulative traffic getting very close to a practical maximum throughput that can be achieved on the 500 Mbps network interface. In plain English this means that network is the limiting factor here and having other resources available, DB server could deliver much higher TPS and QPS figures, if additional network capacity can be provisioned. The ultimate system capacity limit was not established due to time constraints and the fact that Customer application did not utilize more than 300 concurrent DB connections. Optimal DB Configuration Below is a summary of major changes between the MySQL database configurations on the AWS and VCC platforms. As with the file-‐system configuration the objective was to achieve consistent and predictable performance by avoiding resource usage surges and stalls. The proposed optimizations may have a positive effect in general, however, they are specific to a certain workload and use-‐case. Therefore, these optimizations cannot be considered as universally applicable in VCC environments and must be tailored for a specific workload. Settings marked with asterisk (*) are defaults for the DB version used.

< … removed … >

Table 4: Optimized MySQL DB -‐ Recommended Settings

Besides the parameter changes listed above, binary logs (also known as transaction logs) have been moved to a separate volume where Ext4 file-‐system has been setup with the following parameters:

< … removed … >

Further areas for DB improvement: -‐ Consider using the latest stable Percona XtraDB version, which is based on

MariaDB codebase and provides many improvements, including patches from Google and Facebook:

o Redesign of locking subsystem, no reliance on kernel mutexes o Latest versions have removed number of known contention points

resulting in less spins and lock waits and eventually in a better overall performance

o Dump and pre-‐load buffer pool features – allowing much quicker startup and warming-‐up phases

o Online DDL – changing schema does not require downtime o Better query analyzer and overall query performance o Better page compression support and performance o Better monitoring and integration with performance schema o More intelligent flushing algorithm taking in consideration both page

change rates, I/O rates, system load and capabilities and thus providing better performance adjusted to workload out of the box

o Suited better for fast SSD-‐based storage (no added cost for random I/O) and adaptive algorithms not attempting to accommodate for spinning disks shortcomings

o Scales better on SMP (multi-‐core) systems and better utilizes higher number of CPU threads

o Provides fast-‐checksums (hardware assisted CRC32) allowing to lessen CPU overhead while retaining data consistency and security

o New configuration options allowing to tailor InnoDB engine even better to a specific workload

-‐ Consider using more efficient memory allocator, e.g. jemalloc or tc_malloc. o The memory allocator provided as a part of GLIBC is known to fall short

under high concurrency. o GLIBC malloc wasn’t designed for multithreaded workloads and has

number of internal contention points. o Using modern memory allocators suited for high-‐concurrency can

significantly improve throughput by reducing internal locking and contention.

-‐ Perform DB optimization. While optimizing infrastructure may result in significant improvement, even better results may be achieved by tailoring the DB structure itself:

o Consider cluster indexes to avoid locking and contention o Consider page compression. Besides slight CPU penalty, this may

significantly improve throughput while reducing on-‐disk storage several times, resulting in turn in quicker replication and backups

o Monitor performance schema to find out more about in-‐flight DB engine performance and adjust required parameters

o Monitor performance and information schemas to find more details about index effectiveness and build better, more effective indexes

-‐ Perform SQL optimization. No infrastructure optimization can accommodate for badly written SQL requests. Caching and other optimization techniques often mask bad code. SQL queries joining multi-‐million record tables may work just fine in development and completely break down on a production DB. Continuously analyze the most expensive SQL queries to avoid full table scans and on-‐disk temporary tables.

PART V – PEELING THE ONION It is a common saying that performance improvement is like peeling an onion. After addressing one issue, the next one, previously masked, is uncovered and so on… Likewise, in our case, after addressing the storage and DB layers and improving overall application throughput it is became apparent something else was holding the application back from delivering the best possible performance. By this time, DB layer was studied very well, however, the overall application stack and associated connection flows were not yet completely understood. The Customer demonstrated willingness to cooperate and assisted by providing instructions for reproducing JMeter load tests as well as on-‐site resources for an architecture workshop. From this point on, the optimization project speed up tremendously. Not only was it possible to iterate reliably and perform load-‐test against the complete application stack, the understanding of the application architecture and access to Application Performance Management (APM) tool Jennifer made a huge difference in terms of visibility into internal application operation and major performance metrics.

Figure 16: Jennifer APM Console

Besides providing visual feedback and displaying a number of metrics, Jennifer revealed the next bottleneck – the network.

PART VI – PFSENSE The original network design, replicating network structure in AWS, was proposed and agreed with the Customer. Separate networks were created to replicate the functionality of AWS VPC and pfSense appliances were used to provide network segmentation, routing and load balancing.

< … removed … >

Figure 17: Initial Application Deployment -‐ Network Diagram

The pfSense is an open source firewall/router software distribution based on FreeBSD. It is installed on a VM and turns this VM to a dedicated firewall/router for a network. It also provides additional important functions such as load balancing, VPN, DHCP. It is easy to manage using web based UI even for users with little knowledge about underlying FreeBSD system. The FreeBSD network stack is known for it’s exceptional stability and performance. The pfSense appliances have been used many times before and after, thus nobody expected issues coming from that side… Watching the Jennifer XView chart closely in real-‐time is fun by itself, like watching fire. It also is a powerful analysis tool that helps to understand application components behavior.

Figure 18: Jennifer XView -‐ Transaction Response Time Scatter Graph

On the graph above, distance between layers is exactly 10000ms, which is pointing to the fact that one of application services is timing-‐out with 10 second interval and repeating connection attempts several times.

Figure 19: Jennifer APM -‐ Transaction Introspection

Network socket operations were taking a significant amount of time resulting in multiple repeated attempts in 10-‐second intervals. Following old sysadmin adage – “…always blame the network… ” application flows have been analyzed again and pfSense was suspected to loose or delay packets. Interestingly enough, the web UI has reported low to moderate VM load and didn’t show any reasons for concern.

Nonetheless, the console access revealed the truth – the load created by number of short thread spins was not properly reported in the web UI and hidden by averaging calculations. A closer look using advanced CPU and system metrics confirmed that the appliance was experiencing unexpectedly high CPU-‐load, adding to latency and dropping network packets. Adding more CPUs to the pfSense appliances resulted in doubling network traffic passed by them. However, even with the maximum CPU count the network was not yet saturated, suggesting that pfSense appliances may be still limiting application performance. Since pfSense appliances were not an essential requirement and they were just used to provide routing and load-‐balancing capability, it was decided to remove them from application network flow and access subnets by adding additional network cards to VMs, with either NIC connected to corresponding subnet. To summarize -‐ it would be wrong to conclude that pfSense does not fit the purpose and is not a viable option for building virtual network deployments. Most definitely, additional research and tuning would help to overcome the observed issues. Due to time constraints this area was not fully researched and is still pending thorough investigation.

PART VII – JMETER With pfSense removed and HAProxy used for load balancing, overall application throughput was definitely improved. Increasing the number of CPUs on the DB servers and the Cassandra nodes seemed to help as well. The collaborative effort with the Customer yielded great results and we were definitely on the right track. With the floodgates wide open we have been able to push more than 1000+ concurrent users during our tests. About the same time we started seeing another anomaly – one out of three JMeter load agents (generators) was behaving quite strange. After reaching end of the test at 3600 seconds time frame, java threads belonging to the two JMeter servers were shutting down quickly and the third instance shutdown took a while, effectively increasing test window duration and as result negatively impacting average test metrics. All three JMeter servers were reconfigured to use the same settings. For some reason they were using slightly different configurations and were logging data to different paths. It didn’t resolve the underlying issue, though. Due to time constraints it was decided to build a replacement VM rather than to troubleshoot issues with one of the existing VMs. Eventually, a fourth JMeter server was deployed. Besides fixing the issue with java threads startup and shutdown, it allowed us to generate higher loads and provided additional flexibility in defining load-‐patterns.

Lesson learned: for low to moderate loads JMeter is working just fine. For high loads, JMeter may become a breaking point itself. In this case, it is recommended to use scale-‐out approach rather than scale-‐up, keeping the number of java-‐threads per server below a certain threshold.

PART VIII – ALMOST THERE Although AWS performance measurements were still better, we had already significantly improved performance compared to the figures captured during the first round of performance tests. Removing pfSense an average of 587 TPS with 800 VU was achieved. In this test load was spread statically rather than balanced by specifying different target application server IP addresses manually in the JMeter configuration files. With a HAProxy load-‐balancer put in place TPS figures initially went down to 544 VU and after some optimizations (disabled connection tracking, netfilter), it has increased up to 607 TPS with 800 VU – the maximum we’ve seen to date. This represents a 22% increase from the best previous result (498 TPS/800 VU with pfSense yet) and 100% increase from initial performance test. Overall the results were looking more than promising.

Figure 20: Iterative Optimization Progress Chart

Despite good progress the following points still required further investigation:

-‐ Disk I/O skew issues still remained -‐ Cassandra servers disk I/O was uneven and quite high

Our enthusiasm rose more and more as we discovered that VCC platform could serve more users than AWS. The AWS test results showed that past 600VU performance started to decline and we were able to push as high as 1600VU with application being able to support the load and showing higher throughput numbers (~760-‐780TPS), until … The next day something happened, which became another turning point in this project. The application became unstable and the application throughput that we saw just a couple hours earlier decreased significantly. More importantly it started to fluctuate, with the application freezing at random times. The TPS-‐scatter landscape in Jennifer was showing a new anomaly…

Figure 21: Jennifer XView -‐ Transaction Response Time Surges

Since other known bottlenecks have ben removed and MySQL DB was not a weak link in the chain any more, basically being bored during the performance test, the Cassandra cluster became a next suspect.

PART IX – CASSANDRA

The tomcat logs were pointing to Cassandra as well. There were numerous warning messages about excluding one or another node from the connection pool due to connectivity timeouts. After having a closer look at the Cassandra nodes several points drew our attention:

-‐ There was no consistency in the Cassandra ring load -‐ Amounts of data stored on Cassandra nodes varied significantly -‐ Memory usage and I/O profiles were different across the board.

As a common trend after a short normal run period, the average system load on several random Cassandra nodes started growing exponentially eventually making those nodes unresponsive. During this time the I/O subsystem was over-‐utilized as well, yielding very high CPU %wait and queue length on block devices. Everything was pointing to the fact that certain Cassandra nodes initiated compaction (internal data structure optimization) right during the load test, spiraling down in a deadly loop. Another quick conversation with Customer’s architect confirmed the same – it was most likely the SSTable compaction causing the issue.

Figure 22: VCC Cassandra Cluster CPU Usage During the Test

As seen on the graph above, during the various test runs, one or another Cassandra node maxed out CPU utilization. The same configuration in AWS has been working just fine with not perfect but still quite even load and no continuous load spikes.

Figure 23: AWS Cassandra Cluster CPU Usage During the Test

Comparing both VCC and AWS Cassandra deployments led to quite contradicting conclusions:

-‐ VCC has more nodes – 12 vs. 8 in AWS, but it should improve performance, right? -‐ AWS is using spinning disks for Cassandra VMs and VCC storage stack is SSD-‐

based, which should improve performance too… Like with MySQL, it was clear -‐ the optimal, or even “good enough” settings taken from AWS are not good or at times even bad for using on the VCC platform. For historical reasons Customer’s application is utilizing both SQL and NOSQL databases. When mapping AWS infrastructure to VCC, it was decided to build a Cassandra ring using 12 nodes in VCC instead of 8 nodes in AWS, since latter were lot more powerful in terms of individual node specifications. As further tests revealed the better approach would have been just opposite -‐ to use bigger number of smaller VMs for the Cassandra cluster. It is also worth mentioning that Cassandra has been originally designed to run on number of low-‐end systems, based on slow spinning disks. During the past couple years, SSD started to appear more and more often in the Data Centers. While not being a commodity yet, SSDs became a heavily used component in modern infrastructures and the Cassandra codebase was adjusted to make internal decisions and algorithms more suitable for use in conjunction with SSD, and not only spinning disks. Therefore deploying the latest stable Cassandra version could have provided additional benefits right away. Unfortunately, the specification required specific version, and therefore all optimizations have been performed against the older version. Let’s have a quick look at Cassandra’s architecture and some key definitions.

Figure 24: High-‐Level Cassandra Architecture

Cassandra is a distributed key-‐value store initially developed at Facebook. It was designed to handle large amounts of data spread across many commodity servers. Cassandra provides high availability through a symmetric architecture that contains no single point of failure and replicates data across nodes. Cassandra’s architecture is a combination of Google’s Big-‐ Table and Amazon’s Dynamo. Like in Dynamo’s architecture, all Cassandra nodes form a ring that partitions the key space using consistent hashing (see figure above). This is known as distributed hash table (DHT). The data model and single node architecture are mainly based on BigTable in its terminology. Cassandra can be classified as an extensible row store since it can store a variable number of attributes per row. Each row is accessible through a globally unique key. Although columns can differ per row, columns are grouped into more static column families. These are treated like tables in a relational database. Each column family is stored in separate files. In order to allow the level of flexibility of a different schema per row, Cassandra stores metadata with each value. The metadata contains the column name as well as a timestamp for versioning. Like BigTable, Cassandra has an in-‐memory storage structure that is called Memtable, one instance per column family. The Memtable acts as a write cache that allows for fast sequential writes to disk. Data on disk is stored in immutable Sorted String Tables (SSTable). SSTables consist of three structures, a key index, a bloom filter and a data file. The key index points to the rows in the SSTable, the bloom filter enables checking for the existence of keys in the table. Due to the limited size of the bloom filter it is also cached in memory. The data file is ordered for faster scanning and merging. For consistency and fault tolerance, all updates are first written to a sequential log (Commit Log) after which they can be confirmed. In addition to the Memtable, Cassandra provides optional row caches and key cache. The row cache stores a consolidated, up-‐to-‐date version of a row, while the key cache acts as an index to the SSTables. If these are used, write operations have to keep them updated. It is worth

mentioning that only previously accessed rows are cached in Cassandra in both caches. As a result, new rows will only be written to the Memtable but not the cache. In order to deliver the least possible latency and best performance on low-‐end hardware, data writes in Cassandra are using a multi-‐step process, first writing requests to the commit-‐log, then to a MemTable structure and eventually, when flushed, they are appended to and becoming immutable SSTables in the form of a disk file. Over time as the number of SSTables is growing, they are becoming fragmented, which is impacting read operations performance. To make it simple, flushing and compaction operations are vitally important for Cassandra. However, if setup incorrectly or executed at the “wrong” time, they can decrease performance significantly, at times making the entire Cassandra node unresponsive. Exactly this was happening and during the test when several nodes stopped responding and showed very high system load and performing huge amounts of I/O. Obviously, Cassandra’s configuration was tuned for spinning disks on AWS, resulting in unexpected behavior on the SSD-‐based VCC storage stack. As a first measure to gain better visibility into Cassandra’s operation, the DataStax OpsCenter application was deployed. It allowed iterating over various parameters and executing a number of tests against the Cassandra cluster while measuring their impact and helping to observe overall cluster behavior. Applying all the lessons learned earlier and working with VCC storage team the following configuration changes were applied:

< … removed … >

Table 5: Optimized Cassandra -‐ Recommended Settings

Similar to the MySQL optimization, the basic idea is to use more frequent I/O to saturate block device queues less and as a result more optimally utilizing storage stack resources. Besides the recommended option changes, the commit-‐log was moved to a separate volume. Those changes led to predictable and consistent Cassandra performance, evenly and constantly forcing in-‐memory data to disk and avoiding I/O spikes and minimizing stalls due to compaction. Below is a summary of the volumes created for the Cassandra nodes:

xvda 600 IOPS – boot and root xvdb 600 IOPS – lvm2 root extension xvdc 4600 IOPS – data mdadm stripe disk 1 – no partitioning xvde 4600 IOPS – data mdadm stripe disk 2 – no partitioning

xvdf 4600 IOPS – data mdadm stripe disk 3 – no partitioning xvdg 5000 IOPS – commit log disk – no partitioning

There are two more parameters worth mentioning, which are controlling the streaming and compaction throughput limits within the Cassandra cluster. Both values were set to 50MB/s, which is sufficient for normal cluster operation and in line with storage sub-‐system throughput configured on the Cassandra nodes. However, sometimes those thresholds may need to be changed. In case of cluster rebalancing, maintenance, and similar operations the following handy shortcuts may be used to control thresholds cluster wide.

# for n in 01 02 03 04 05 06 07 08 09 10 11 12 ; do ./nodetool -‐h node$n -‐p 9199 setcompactionthroughput 150 ; done # for n in 01 02 03 04 05 06 07 08 09 10 11 12 ; do ./nodetool -‐h node$n -‐p 9199 setstreamthroughput 150 ; done

Obviously, after maintenance has completed, those thresholds should be set back to appropriate values for normal production use.

PART X – HAPROXY With the DB layer fixed, application performance became stable across tests, although two points were still raising some concerns:

-‐ After an initial spike at the beginning of a load test, the number of concurrent connections abruptly dropped almost two times

-‐ The amount of Virtual User requests reaching either application server was quite different sometimes reaching a 1:2 ratio

Figure 25: Jennifer APM -‐ Concurrent Connections and Per-‐server Arrival Rate

It was time to take a closer look at the software load-‐balancers based on HAProxy. This application is known to be able to serve 100K+ concurrent connections, so just one thousand concurrent connections should not get even close to the limit. Additional research showed that the round-‐robin load-‐balancing scheme is not performing as expected and was causing a concentration of requests on one or another system in an unpredictable manner. The most even request distribution was achieved by using least-‐connect algorithm. After implementing this change, the load eventually evenly spread across all systems.

Figure 26: Jennifer APM -‐ Connection Statistics After Optimization

Furthermore, a number of SYN flood kernel warnings in the log files as well as nf_conntrac complaints (Linux connection tracking facility used by iptables) about its overrun buffers and dropped connections pointed to next optimization steps. Initially, it was decided to increase the size of the connection tracking tables and internal structures and disable the SYN flood protection mechanisms.

< … removed … >

This did show some improvement, however, eventually it was decided to turn iptables off completely to remove any possible obstacles and latency introduced by this facility.

During the subsequent tests when generated load was increased further, HAProxy hit another issue often referred to as “TCP socket exhaustion”. A quick reminder – there were two layers of HAProxies deployed. The first layer was load-‐balancing the incoming http requests originating from the application clients between the java application server (tomcat) instances and the second layer passing requests from the java application server to the primary and stand-‐by MySQL DB servers. HAProxy works as a reverse-‐proxy and so uses its own IP address to establish connections to the server. Most operating systems implementing a TCP stack typically have around 64K (or less) TCP source ports available for connections to a remote IP:port. Once a combination of “source IP:port => destination IP:port” is in use, it cannot be re-‐used. As a consequence there cannot be more than 64K open connections from a HAProxy box to a single remote IP:port couple. On the front layer the http request rate was a few hundreds per second, so we never ever reach the limit of 64K simultaneous open connections to the remote service. On the backend layer there should not have been more than a couple of hundred persistent connections during peak time since connection pooling was used on the application server. So this was not the problem either. It turns out that there was an issue with the MySQL client implementation. When a client sends its “QUIT” sequence, it performs a few internal operations before immediately shutting down the TCP connection, without waiting for the server to do it. A basic tcpdump revealed this behavior. Note that this issue cannot be reproduced on a loopback interface or on the same system, because the server answers fast enough. But over a LAN connection with 2 different machines the latency raises past the threshold where the issue becomes apparent. Basically, here is the sequence performed by a MySQL client:

MySQL Client ==> "QUIT" sequence ==> MySQL Server MySQL Client ==> FIN ==> MySQL Server MySQL Client <== FIN ACK <== MySQL Server MySQL Client ==> ACK ==> MySQL Server

This results in the client connection to remain unavailable for twice the MSL (Maximum Segment Life) time, which defaults to 2 minutes. Note that this type of close has no negative impact when the MySQL connection is established using a UNIX socket.

Explication of the issue by Charlie Schluting1: “There is no way for the person who sent the first FIN to get an ACK back for that last ACK. You might want to reread that now. The person that initially closed the connection enters the TIME_WAIT state; in case the other person didn’t really get the ACK and thinks the connection is still open. Typically, this lasts one to two minutes.” Since the source port is unavailable for the system for 2 minutes, this means that over 534 connection requests per seconds will contribute to TCP source port exhaustion:

64000 (available ports) / 120 (number of seconds in 2 minutes) = 533.333

This TCP port exhaustion appears on a direct MySQL client server connection as well, but also through the HAProxy because it forwards the client traffic to the server. And since there were several clients talking to the same HAProxy, it happened much faster on the HAProxy. However, this does not explain the front-‐side HAProxy issues, where HTTP connections were used, not MySQL protocol. The problem: keep-‐alive. A very useful feature in the past, when servers were huge yet slow and client concurrency was generally much lower. Formerly, you’d think twice before forking another process to serve the next incoming connection on a web server – the new process creation overhead was way too expensive. With server hardware becoming more and more powerful and Linux kernel and software stack getting more and more optimized, most server implementations today are using threads and can pick new connections in a very fast and effective manner. In a modern world, specifically with the advent of REST and web-‐services, short-‐lived stateless connections are much more favorable. As the number of clients and concurrency grows it is becoming less and less optimal to keep sockets busy anticipating another request from the same client. Another lesson learned: be very conservative with the keep-‐alive feature and consider turning it off or reducing the keep-‐alive timeout significantly for certain use-‐cases. This was addressed in the Tomcat connector configuration as well. See corresponding chapter. While HAProxy is providing a number of options and mechanisms for dealing with connection time-‐outs and keep-‐alive HTTP connections, HAProxy still operates above the transport layer and may not always be able to help with half-‐closed TCP connections.

1 taken from http://www.enterprisenetworkingplanet.com/print/netsp/article.php/3595616/Networking-‐101-‐TCP-‐In-‐More-‐Depth.htm

So how can TCP source port exhaustion be avoided? First, a “clean” sequence should be used (find the difference to one above ;)

MySQL Client ==> "QUIT" sequence ==> MySQL Server MySQL Client <== FIN <== MySQL Server MySQL Client ==> FIN ACK ==> MySQL Server MySQL Client <== ACK <== MySQL Server

Actually, this sequence occurs when both MySQL client and server are hosted on the same box and are using the loopback interface, which is why it was mentioned earlier that added latency between the client and the server is crucial to reproduce the issue.. Until MySQL developers rewrite the code to follow the sequence above, there won’t be any improvement here! Second, increasing source port range By default, on a Linux box, there are around 28K source ports available for a single IP:port tuple:

$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 32768 61000

This limit can be increased to close to 64K source ports:

< … removed … >

And don’t forget to update the /etc/sysctl.conf file. It’s a good idea to add this configuration to most busy network servers. Third, allow usage of source port in TIME_WAIT A few configurations can be used to tell the kernel to reuse or recycle the connection in TIME_WAIT state faster:

< … removed … >

The tw_reuse can be used safely, be but careful with tw_recycle. It may have side effects. Fourth, using multiple IPs to get connected to a single server

In the HAProxy configuration, the source IP address to be used to establish a connection to a server can be specified on the server line. Additional server lines using different source IP addresses can be added. In the example below, the source IPs 10.0.0.100 and 10.0.0.101 are local IP addresses of the HAProxy box:

[...] server mysql_A 10.0.0.1:3306 check source 10.0.0.100 server mysql_B 10.0.0.1:3306 check source 10.0.0.101 [...]

This would allow us to open up to 2 x 64K = 128K outgoing TCP connections. The kernel is responsible to select a new TCP port when the HAProxy requests it. Despite improving things a bit, we may still reach some source port exhaustion down the road, which would happen at around 80K connections in TIME_WAIT. Fifth and last, let HAProxy manage TCP source ports You can let HAProxy decide which source port to use when opening a new TCP connection on behalf of the kernel. To address this, HAProxy has built-‐in functions, which are more efficient than those implemented in a regular kernel. Let’s update the configuration above accordingly:

[...] server mysql_A 10.0.0.1:3306 check source 10.0.0.100:1025-‐65000 server mysql_B 10.0.0.1:3306 check source 10.0.0.101:1025-‐65000 [...]

A test showed 170K+ connections in TIME_WAIT with 4 source IPs while avoiding source port exhaustion. As explained in the OS Optimization chapter, since HAProxy is a single-‐threaded non-‐blocking application, it may be a good idea to pin haproxy process to a specific CPU as well, allowing the other CPUs to handle netdev and blkdev interrupts.

# service haproxy status haproxy (pid 29621) is running... # taskset -‐c -‐p 3 29621 pid 29621's current affinity list: 0-‐3 pid 29621's new affinity list: 3

Although this additional step will not result in a significant performance boost, it will make system operation much smoother and will better utilize available system resources.

PART XI – TOMCAT The last, or actually topmost application in the stack was Java application server –Tomcat. Within Tomcat itself, there is not much to tune, but those couple of points can make a huge difference. Tomcat Connectors Configuration The following shows the resulting connector configuration with modified or added options being highlighted:

< … removed … >

In plain English:

-‐ We don’t want to resolve hostnames for incoming requests -‐ We do want to get rid of keep-‐alive by all means, since the application is

essentially a web service and serving short stateless requests. Allowing keep-‐alive, would quickly hog all connector threads and they won’t be available to accept new requests and the threads will not be returned back to the pool until timed-‐out or the client closes the connection, with the latter typically not occurring since the clients are trying to keep the connection alive in the first place.

Generally speaking in the options above the maximum threads limit is also set a bit too high. It may be a better idea to have a cluster configuration with several instances and a lower thread number per instance (e.g. 2 x 500) in order to lessen thread management overhead. In practice, any configuration change like this requires thorough testing. Due to time constraints a cluster configuration was not tested. Tomcat, Java and Application Logging The java application logging was a real resource hog in terms of I/O, disk space and CPU utilization. Application messages were logged to several pipelines, including the console (stdout and stderr) and after each 10min test run generated about 10-‐20 GB worth of log files. Yes, Gigabytes, not Megabytes. At this point it shall be noted that with all admiration for Jennifer, this toolkit is also very generous when it comes to logging. Jennifer APM itself produced a few GBs of log files after every test run… Obviously, without a proper log rotation facility and configuration in place, those logs were exhausting the underlying disk volumes rather quickly, resulting in unpredictable application server behavior and possible system failure. The logging configuration was adjusted:

-‐ To avoid duplicate logging -‐ To use the valve logging facility, which is performing better

-‐ To decrease amount of log messages by raising the log severity threshold to WARN or ERROR depending on use-‐case.

This resulted in a significant drop in I/O and request volume by almost an order of magnitude. Correspondingly, log file sizes were reduced by an order of magnitude without sacrificing any important information for operations. DB Connection pooling Finally, since DB requests have been processed within 500ms on average, there was no need anymore to keep an over-‐blown connection pool to accommodate for DB slowness and delays. Therefore, connection pool configuration was adjusted as well:

< … removed … >

This resulted in more effective connection pool utilization. The example below is showing that pool usage is averaging about 45-‐48 connections for 900 virtual users.

Figure 27: Jennifer APM -‐ DB Connection Pool Usage

Obviously, connection pool timeouts and eviction rules have to be setup in concert with DB server connection time-‐outs to decrease number of half-‐closed connections on either end. Further areas for improving Tomcat performance and throughput:

-‐ Fewer threads mean smaller memory footprint and less context switching resulting in better CPU cache utilization. Therefore, it is generally recommended to start with a lower number of threads. If Tomcat’s thread-‐pool is being exhausted too quickly, it is worth further investigation to establish the root cause. Is it rather a problem of individual requests taking too long? Are threads returning to the pool? If, for example, database connections are not released, threads pile up waiting to obtain a database connection thereby making it impossible to process additional requests. This might indicate a problem in the application code.

-‐ Generally it is not recommended to configure a single connector with more than 500-‐750 threads. If there is such a need, it’s worth looking at setting up a cluster configuration with several instances

-‐ It’s recommended to validate the Tomcat deployment and remove unused classes and libraries to reduce the overall memory footprint

-‐ It may be worthwhile to test different database connection pool (DBCP) implementations

PART XII – JAVA When dealing with any kind of Java application, the immediate concern coming to mind is Garbage Collection. While analyzing GC behavior it was established that full garbage collections are happening more often than desired.

Figure 28: JVM Garbage Collection Analysis

From the chart above one can see that during the test run the application is executing quite a few full garbage collections leading to sub-‐optimal application throughput at around 91%

After performing minor garbage collection algorithm tuning, the throughput increased and at the same time resulted in almost completely avoiding full garbage collections altogether.

< … removed … >

Figure 29: JVM Garbage Collection Analysis – Optimized Run

Having that said, thorough study and JVM tuning was not possible due to the time constraints and there is still some room for improving java application throughput while reducing latency caused by garbage collection. It’s recommended to perform further studies around application memory allocation patterns, and do a proper sizing for JVM memory segments. Besides the usual suspects, there are couple more minor points to watch out for. On systems, where both IPv4 and IPv6 stacks supported, the JVM may become confused and try to use IPv6 whereas expectation is for IPv4 to be used. You can either disable this behavior using JVM startup options or disable IPv6 addressing on the OS level:

< … removed … >

PART XIII – OS OPTIMIZATION OS optimization may be a subject for a whole book talking about generic tuning and improvements, yet each and every use-‐case and application will require a specific approach. Due to the lack of time and various constraints it was decided to take a holistic approach and address low hanging fruits that would have the most impact across the board. By following the application flows it is obvious that packets are first entering the system via a network interface so the network device itself and the TCP-‐stack were the first points where generic optimizations could be applied. As next step, the application was processing received network data in memory while reading and writing to persistent storage. While the storage subsystem optimizations have been already mentioned previously, now we will be looking for possible contention points between those two subsystems. Memory allocation and memory pressure is another area where some generic optimizations are possible. Eventually, since the application was running in multitasked environment, we wanted to make sure that the application is receiving a higher priority and all required resources, while other non-‐critical processes were assigned a lower priority and not competing for resources with the prime application. When taking any optimization steps it is important to understand that a modern OS is a coherent and very well balanced construct. Applying certain changes to various OS parameters may cause an imbalance for others and may lead to performance deterioration in the long run and to negative overall results. Therefore, it is mandatory to test all optimization steps thoroughly prior to applying them to production systems. Having the above disclaimer in place, let’s get to the bits and bytes.

PART XIV – NETWORK STACK Setting up an efficient network can be a daunting task. In contrast to the physical server world on a virtualized infrastructure we have to consider both physical NICs available to the hypervisor and the virtual network devices as seen by the VMs. There are many possible scenarios where network throughput can be relevant:

• Hypervisor Xen dom0 throughput: The traffic sent/received directly by `dom0`. • Single-‐VM throughput: The traffic sent/received by a single VM. • Multi-‐VM throughput: The traffic sent/received by multiple VMs, concurrently.

Here, we are interested in aggregate network throughput.

• Single-‐VCPU VM throughput: The traffic sent/received by VMs using a single VCPU only.

• Single-‐VCPU single-‐TCP-‐thread VM throughput: The traffic sent/received by a Single TCP thread in single-‐VCPU VMs.

• Multi-‐VCPU VM throughput: The traffic is sent/received by a multi-‐VPU VMs. • Network throughput for storage: The traffic sent/received for virtualized

storage access, which uses different underlying physical NICs The figure below applies to PV XEN guests, and to HVM guests with PV drivers.

Figure 30: XEN PV Driver and Network Device Architecture

When a process in a VM, e.g. a VM with domID = X, wants to send a network packet, the following occurs:

1. A process in the VM generates a network packet P, and sends it to a VM's virtual network interface (VIF), e.g. ethY_n for some network Y and some connection n.

2. The driver for that VIF, netfront driver, then shares the memory page (which contains the packet P) with the backend domain by establishing a new grant entry. A grant reference is part of the request pushed onto transmit shared ring (Tx Ring).

3. The netfront driver then notifies, via an event channel (not depicted in the diagram), one of the netback threads in dom0 (the one responsible for ethY_n) where in the shared pages the packet P is stored.

4. The netback (in dom0) fetches P, processes it, and forwards it to vifX.Y_n; 5. The packet is then handed to the back-‐end network stack, where it is treated

according to its configuration just like any other packet arriving on a network device.

When a VM is to receive a packet, the process is almost the reverse of the above. The key difference is that on receive there is a copy being made: it happens in dom0, and is a copy from back-‐end owned memory into a Tx Buf, which the guest has granted to the back-‐end domain. The grant references to these buffers are in the request on the Rx Ring (not Tx Ring). Sounds easy, right? One of the promises of Virtualization was to remove complexity from technology. From a performance tuning point of view, complexity just increased and shifted under the hypervisor hood… For complete technical explanation it is recommended to refer to XEN Wiki, where the information above was taken from. As a short synopsis, in order to achieve the best throughput, we have to consider the following recommendations:

• Proper PV drivers must be in place. It may help to make use of underlying hardware.

• Enabling NIC offloading may help to save some CPU cycles • It is recommended to use multi-‐threaded applications for sending/receiving

network traffic. This will give the OS better chances to distribute the workload among multiple CPUs

• For some use-‐cases it may be beneficial to consider using several 1VCPU load-‐balanced VMs rather than one huge VM with multiple VCPUs.

• If the network driver heavily utilizes one of the available VCPUs, consider associating the application with one or more less loaded VCPU, thus reducing VCPU contention and better utilizing resources.

• Consider using modern Linux kernels where the underlying architecture has been improved so that the VM's non-‐first VCPUs can process interrupt requests

• Check in /proc/interrupts whether your device exposes multiple interrupt queues. If the device supports this feature, make sure that it is enabled.

• If the device supports multiple interrupt queues, distribute the processing of them either automatically (by using the irqbalance daemon), or manually (by setting /proc/irq/<irq-‐no>/smp_affinity) to all or a selected subset of VCPUs.

• Enable Jumbo Frames for the whole connection. This should decrease the number of interrupts, and therefore decrease the load on the associated VCPUs (for a specific amount of network traffic).

• If a host has spare CPU capacity, give more VCPUs to dom0, increase the number of netback threads, and restart the VMs (to force re-‐allocation of VIFs to netback threads).

• Experiment with the TCP parameters, e.g. window size and message size to identify the ideal combination for your workload and scenario.

The VCC platform is a shared environment, therefore changes to the hypervisor and dom0 have to go trough the change and release management process and if approved by the QA team, they may be included in the next release. This is to say, that ad-‐hoc changes to the network stack outside of the VMs were not considered during the project, with the exception of Jumbo frames. This feature was already available and supported and was just lacking automation that would orchestrate setting MTU size for both guest-‐ and hypervisor-‐side network devices. Therefore, only changes limited to the guest VM network settings have been performed as outlined below.

< … removed … >

Figure 31: Recommended Network Optimizations

Besides mentioned above measures, some additional research has been performed for improving the TCP stack settings. These are default settings:

net.core.rmem_max = 229376 net.core.wmem_max = 229376 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.core.netdev_max_backlog = 1000 net.ipv4.tcp_congestion_control = cubic txqueuelen:1000 generic-‐segmentation-‐offload: on net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_fin_timeout = 60

A number of tests have been conducted using iperf to measure throughput between two systems in the same subnet and average throughput was as shown below:

[ ID] Interval Transfer Bandwidth [ 24] 0.0-‐ 1.0 sec 57.2 MBytes 480 Mbits/sec [ 24] 1.0-‐ 2.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 2.0-‐ 3.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 3.0-‐ 4.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 4.0-‐ 5.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 5.0-‐ 6.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 6.0-‐ 7.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 7.0-‐ 8.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 8.0-‐ 9.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 9.0-‐10.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 10.0-‐11.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 11.0-‐12.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 12.0-‐13.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 13.0-‐14.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 14.0-‐15.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 15.0-‐16.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 16.0-‐17.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 17.0-‐18.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 18.0-‐19.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 19.0-‐20.0 sec 57.0 MBytes 478 Mbits/sec [ 24] 0.0-‐20.1 sec 1144 MBytes 478 Mbits/sec

After adjusting TCP stack using the script below:

< … removed … >

An additional ~0.5 MB/s throughput was achieved:

[ 35] 0.0-‐ 1.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 1.0-‐ 2.0 sec 57.5 MBytes 483 Mbits/sec [ 35] 2.0-‐ 3.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 3.0-‐ 4.0 sec 57.2 MBytes 480 Mbits/sec [ 35] 4.0-‐ 5.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 5.0-‐ 6.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 6.0-‐ 7.0 sec 57.6 MBytes 483 Mbits/sec [ 35] 7.0-‐ 8.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 8.0-‐ 9.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 9.0-‐10.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 10.0-‐11.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 11.0-‐12.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 12.0-‐13.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 13.0-‐14.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 14.0-‐15.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 15.0-‐16.0 sec 57.5 MBytes 482 Mbits/sec

[ 35] 16.0-‐17.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 17.0-‐18.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 18.0-‐19.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 19.0-‐20.0 sec 57.5 MBytes 482 Mbits/sec [ 35] 0.0-‐20.2 sec 1148 MBytes 482 Mbits/sec

The Bandwidth-‐delay product (BDP) settings above were sized for a 10GE connection that we possibly will have in VCC one day but should further be adjusted for the current real-‐life scenario. For the currently available 500Mbit bandwidth, the BDP values will have to be re-‐calculated and set much more conservatively. AWS Optimizations Since the beginning of the project there was some suspicion around “AWS can’t be using off the shelf default settings. They must be tuning their instances to perform the best on their infrastructure and those changes may vary from specialized kernel settings and drivers to certain application settings allowing to realize the full potential”. Only at the end of the project we found out that this suspicion was well grounded and in fact the table below shows some network stack settings and their values at the beginning of the journey:

< … removed … >

Table 6: Network Parameter Comparison

On the positive side, by this time we had already come up with those optimizations and applied them as well. However, those options are just scratching the surface and a number of settings and their combinations that might have been or may be still tuned are pretty much countless. You may refer to this slide-‐deck for more details: http://www.slideshare.net/cpwatson/cpn302-‐yourlinuxamioptimizationandperformance

CONCLUSION – LESSONS LEARNED Since the very early stages of the project there was a question: are we comparing apples to apples or apples and oranges. Some differences in the AWS and VCC platforms may be as obvious as number of VMs, CPUs, Gigabytes of RAM and IOPS. These are apparent and easy to count, doing some napkin math. However, looking beyond those numbers, it is quickly becoming clear that while the temptation is high to conclude that 8 CPUs are going to perform better than 4, reality shows that there is no quick and simple answer. There are a number of unknowns associated with every resource type, e.g.:

-‐ VCPU vs. VCPU: o Are we talking about cores, native threads or hyper-‐threads? o What is the CPU and bus frequency on the physical host? o What are the physical CPU generation and family used?

-‐ RAM vs. RAM o Are we talking about on-‐board RAM, directly addressable by CPU? o Are we talking about NUMA architecture? o What is the bus speed? o What bus architecture is used? o How many memory access channels?

-‐ IOPS vs. IOPS o Are those IOPS backed by SSDs or rotating disks? o What storage controller is used? o What storage stack is used? o Are those IOPS guaranteed or just provisioned? o Etc…

As saying goes, the devil is in the details and comparing resources provided by various platforms can help initially with some rough sizing decisions, but cannot be considered a reliable metric used for measuring cloud infrastructure performance. Obviously, there is a need for some artificial metric, let’s call it hmm… *cloud stones*, that may be used to assess various cloud platforms. Then a simplistic way of measuring would be to say – AWS platform is capable delivering X cloud stones, GAE is able to do Y and VCC is pushing limits with Z. But, does it make any sense? No, it does not. How would you know how many *cloud stones* does your application need? Another popular approach is to tie those *cloud stones* to cost and create a chart per cloud provider, showing which one delivers best bang for the buck. While it makes decisions somewhat simpler, since the budget is presumably known and it is relatively easy to see, where your investment will result in the best possible delivered performance. Yet, possibly delivered performance does not equal realized performance and this is where the crux is and this is why less can really become more.

The right question to ask would be – how would you realize the maximum performance for your application? Obviously, there are countless guides available, provided by vendors, both financed and independent consulting entities and myriads of bloggers on the net. Will they help? Possibly. It is a win-‐or-‐lose game. One optimization can boost performance – another one can diminish it again. And the chance that there is a combination that is perfectly matching your application and infrastructure is far less than winning a lottery. So, the question is still open -‐ what it the right recipe for realizing the best possible performance? First of all, there is no magic bullet or setting that can be universally applied with consistent and predictable result. Second, even following best practices and vendor prescriptions won’t necessary provide the best possible outcome. Using architectural blueprints can help avoiding known mistakes, but will not address unknown ones. The only way is to work iteratively by clearly setting your goals and employing repeatable automated testing. The optimization process is going to be in line with following steps:

-‐ Execute the test and capture performance metrics -‐ Correlate and interpret metrics -‐ Identify the bottleneck and understand the root cause -‐ Address the bottleneck by implementing well documented changes -‐ Repeat the process until you have achieved your objectives

It may seem that in certain cases this loop may be endless, since improving one metric may be in the conflict with another one and common wisdom is saying you can not have the pie and eat it at the same time. However, in reality, with each cycle you will be learning a lot about your application -‐ how it interacts with the infrastructure and the other way around, how the infrastructure tolerates certain application shortcomings. Very soon you’ll become more and more effective in identifying the next optimization step. To put it simple: optimization is an iterative learning process, not a tweak or milestone. Coming back to the proof of concept project, this is exactly what has been done. Building a repeatable test framework. Decomposing the whole application into subsystems, which have been tuned in isolation and than integrated again and optimized in combination with the associated re-‐testing. And here is the result in a picture that says more than a thousand words.

Figure 32: Last Performance Test Results

Interesting questions related to this might be: -‐ Would the same approach help to improve application performance on the AWS

platform, while reducing the infrastructure footprint and improving performance? Yes, definitely.

-‐ Can the findings outlined above be reused for the AWS platform? Some – may be, others – unlikely and in most cases not.

And the ultimate question: Which platform is the best for your application? This can be answered quite simple: The one that helps you realize all the performance your workload requires and which you are paying for.

The Cloud Story or Less is More...

Technology