Eliminating OS-caused Large JVM Pauses for Latency-sensitive Java-based Cloud Platforms Zhenyun Zhuang, Cuong Tran, Haricharan Ramachandra, Badri Sridharan {zzhuang, ctran, hramachandra, bsridharan}@linkedin.com LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States Abstract—For PaaS-deployed (Platform as a Service) customer-facing applications (e.g., online gaming and online chatting), ensuring low latencies is not just a preferred feature, but a must-have feature. Given the popularity and powerful- ness of Java platforms, a significant portion of today’s PaaS platforms run Java. JVM (Java Virtual Machine) manages a heap space to hold application objects. The heap space can be frequently GC-ed (Garbage Collected), and applications can be occasionally stopped for long time during some GC and JVM activities. In this work, we investigated the JVM pause problem. We found out that there are some (and large) JVM STW pauses cannot be explained by application-level activities and JVM activities during GC; instead, they are caused by OS mechanisms. We successfully reproduced such problems and root-cause-ed the reasons. The findings can be used to enhance JVM implementation. We also proposed a set of solutions to mitigate and eliminate these large STW pauses. We share the knowledge and experiences in this writing. Keywords-PaaS; Java; JVM; Performance; Cloud platform I. I NTRODUCTION PaaS (Platform as a Service) cloud platform [1], where customer applications are deployed to platform servers that reside in the “cloud”, promises a cost-effective and adminstration-efficient solution to the traditional needs of deploying applications. Many of the PaaS-deployed appli- cations are customer-facing (e.g., online gaming and online chatting), thus ensuring low latencies is not just a preferred feature, but a must-have feature for these applications. Various studies have suggested that 200ms latency is the maximum latency an online user can tolerate before going away. Because of this, ensuring lower-than-200ms (or even smaller) latency should be part of the defined SLA (Service Leve Agreements) for applications serving online users. Given the popularity and powerfulness of Java platforms, a significant portion of today’s PaaS platforms run Java. One example is Oracle Java Cloud [2], which provides cloud-deployed Java applications using WebLogic Server [3]. Despite tremendous efforts put at various layers (e.g., application layer, JVM layer) to improve the performance of Java applications, based on our production experiences, Java applications can occasionally experience unexplainable large STW (Stop-The-World) JVM pauses that cannot be explained by typical known reasons at application layer. Java-based applications run in JVM (Java Virtual Ma- chine), which manages a heap space to hold application objects. The heap space can be frequently GC-ed (Garbage Collected), and JVM could be stopped during GC and JVM activities (e.g., Young or Full GC), which introduce STW (Stop-The-World) pauses to the applications. De- pending on JVM options supplied when starting the JVM instance, various types of GC and JVM activities are logged into GC log files. Though GC-induced STW pauses that scan/mark/compact heap objects are well-known and paid much attention to, as we find out, there are some (and large) STW pauses could be caused by OS (Operating System) mechanisms. In our production environments, we have been seeing OS-caused large STW pauses (>11 seconds) hap- pened to our mission-critical Java applications. Such pauses cannot be explained by application-level activities and the garbage collection activities during GC. For latency-sensitive and mission-critical Java applica- tions, the larger-than-SLA STW pauses are intolerable. Hence we spent efforts investigating the problem. We suc- cessfully reproduced the problem in lab environments and root caused the reasons. The large STW pauses are caused by GC logging write() calls being blocked. These write() calls, though are issued in buffered write mode (i.e., non-blocking IO), can still be blocked due to certain OS internal mecha- nisms related to “writeback” [4] IO activities. Specifically, when buffered write() needs to write to a file, it firstly writes to memory pages in OS cache. These memory pages can be locked by OS cache-flushing mechanism of “writeback”, which could last for substantially long time when IO traffic is heavy. Furthermore, for typical production applications, the application-level logging (e.g., access logs) and log rotations also prove to be sources of background IO traffic. We proposed solutions to mitigate the large STW pauses. These solutions span in different layers including enhancing JVM, reducing background IO traffic, improving application IO, and separating GC logging from other IO. Depending scenarios, these solutions can be applied separately or in tandem. In this work, we share our findings. For the remainder of the paper, after providing necessary technical background in section II, we present the production issue and investigations in Section III. Based on the findings, we propose the solutions in Section IV. We perform performance evaluation and show the results in Section V. Section VI gives related works. And finally Section VII concludes the work.
8
Embed
Eliminating OS-caused Large JVM Pauses for Latency-sensitive Java-based Cloud Platforms
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States
Abstract—For PaaS-deployed (Platform as a Service)customer-facing applications (e.g., online gaming and onlinechatting), ensuring low latencies is not just a preferred feature,but a must-have feature. Given the popularity and powerful-ness of Java platforms, a significant portion of today’s PaaSplatforms run Java. JVM (Java Virtual Machine) manages aheap space to hold application objects. The heap space can befrequently GC-ed (Garbage Collected), and applications can beoccasionally stopped for long time during some GC and JVMactivities.
In this work, we investigated the JVM pause problem.We found out that there are some (and large) JVM STWpauses cannot be explained by application-level activities andJVM activities during GC; instead, they are caused by OSmechanisms. We successfully reproduced such problems androot-cause-ed the reasons. The findings can be used to enhanceJVM implementation. We also proposed a set of solutions tomitigate and eliminate these large STW pauses. We share theknowledge and experiences in this writing.
[7] “Workloads of java and background io,”https://github.com/zhenyun/JavaGCworkload.
[8] R. Love, Linux System Programming: Talking Directly to theKernel and C Library. O’Reilly Media, Inc., 2007.
[9] “strace - trace system calls and signals,”http://linux.die.net/man/1/strace.
[10] A. M. Bishop, “The /proc file system and procmeter,” LinuxJ., vol. 1997, no. 36es, Apr. 1997. [Online]. Available:http://dl.acm.org/citation.cfm?id=326832.326837
[11] D. Zamboni, Learning CFEngine 3: Automated System Ad-ministration for Sites of Any Size. O’Reilly Media, Inc.,2012.
[16] G. Tene, B. Iyengar, and M. Wolf, “C4: The continuouslyconcurrent compacting collector,” SIGPLAN Not., vol. 46,no. 11, Jun. 2011.
[17] T. Printezis, “Use of the jvm at twitter: A bird’s eye view,”SIGPLAN Not., vol. 49, no. 11, pp. 1–1, Jun. 2014.
[18] G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J.Irwin, and M. Wolczko, “Tuning garbage collection for re-ducing memory system energy in an embedded java environ-ment,” ACM Trans. Embed. Comput. Syst., vol. 1, no. 1, pp.27–55, Nov. 2002.
[19] V. Horky, P. Libic, A. Steinhauser, and P. Tuma, “Dos anddon’ts of conducting performance measurements in java,” inProceedings of the 6th ACM/SPEC International Conferenceon Performance Engineering, ser. ICPE ’15, 2015.
[20] R. Bryant, R. Forester, and J. Hawkes, “Filesystem per-formance and scalability in linux 2.4.17,” in Proceedingsof the FREENIX Track: 2002 USENIX Annual TechnicalConference, Berkeley, CA, USA, 2002.
[21] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev,M. F. Kaashoek, R. Morris, and N. Zeldovich, “An analysisof linux scalability to many cores,” in Proceedings of the9th USENIX Conference on Operating Systems Design andImplementation, ser. OSDI’10, 2010.
[22] X. Song, H. Chen, R. Chen, Y. Wang, and B. Zang, “A casefor scaling applications to many-core with os clustering,” inProceedings of the Sixth Conference on Computer Systems,ser. EuroSys ’11, 2011.
[23] L. Chen and G. R. Gao, “Performance analysis of cooley-tukey fft algorithms for a many-core architecture,” in Pro-ceedings of the 2010 Spring Simulation Multiconference, ser.SpringSim ’10, 2010.
[24] Z. Zhuang, C. Tran, H. Ramachandra, and B. Sridharan,“Ensuring high-performance of mission-critical java applica-tions in multi-tenant cloud platforms,” in Proceedings of the2014 IEEE 7th International Conference on Cloud Computing(CLOUD), Anchorage, AK, USA, 2014.