This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Lock contention (particularly spinning lock contention) is the primary, and probably worst, cause of cache line contention
• Cache line contention does have a “cost” associated with NUMA systems, but it is not the same “cost” that you experience with local vs. remote memory latency in NUMA systems
• However, it’s not only about lock contention
Cache line contention can also come from sharing cache lines due to poor data structure layout – two fields in a data structure that are accessed by completely different processes/threads, but end up in the same cache line
Worst case: an unrelated and frequently accessed field occupies the same cache line as a heavily contended lock
Other atomic operations, such as atomic-add, can also generate cache line contention
Additionally, the processor’s cache prefetch mechanism may also cause false cache line contention
• Test program to show the cost of cache line contention in large NUMA systems:
• Bind threads (1 per core) to specified cores. Memory is allocated from a specific node.
• Once the threads are synchronized, perform a tight loop doing spin_lock/spin_unlock 1,000,000 times. This generates an extreme amount of cache line contention. The spinlock implementation was taken from a Linux 3.0 based kernel.
• Based on the number of threads and the loop iteration count we can calculate the average number of “operations per second per CPU” when <N> CPUs are involved in the cache line contention.
• This is not a real-world test. While this is a micro-benchmark, it does show the effects of cache line contention so that real code can be written with cache line contention in mind.
1) There is a huge drop in performance when going from 15-cores on1-socket to 30-cores on 2-sockets
2) There is a smaller drop in performance when the lock’s memory location is completely remote from the sockets involved in cache line contention (nodes 1-2 vs. nodes 2-3)
• Many applications scale based on the number of CPUs available. For example, one or two worker threads per CPU.
• However, many applications today have been tuned for 4-socket/40-core and 8-socket/80-core Westmere platforms.
• Going from 40- or 80-cores to 240-cores (16-sockets) is a major jump.
• Scaling based only on the number of CPUs is likely to introduce significant lock and cache line contention inside the Linux kernel.
• As seen in the previous slides, the impact of cache line contention gets significantly worse as more sockets and cores are added into the system – this is a major concern when dealing with 8- and 16-socket platforms.
• This has led us to pursue minimizing cache line contention within Linux kernel locking primitives.
We use the ORC tool to monitor the coherency controller results
(ORC is a platform dependent tool from HP that reads performance counters in the XNC node controllers)
Coherency Controller Transactions Sent to Fabric Link (PRETRY number)
Socket Agent 10users 40users 400users
0 0 17,341 36,782 399,670,585
0 8 36,905 45,116 294,481,463
1 0 0 0 49,639
1 8 0 0 25,720
2 0 0 0 1,889
2 8 0 0 1,914
3 0 0 0 3,020
3 8 0 0 3,025
4 1 45 122 1,237,589
4 9 0 110 1,224,815
5 1 0 0 26,922
5 9 0 0 26,914
6 1 0 0 2,753
6 9 0 0 2,854
7 1 0 0 6,971
7 9 0 0 6,897
PRETRY indicates the associated read needs to be re-issued.
We can see that when users increase, PRETRY on socket 0 increases rapidly.
There is serious cache line contention on socket 0 with 400 users. Many jobs are waiting for the memory location on socket 0 which contains the spinlock.
PRETRY number on socket 0:400 users = 400M + 294M = 694M
• The proportion of time for the functions file_move() and file_kill() is now small in the 400 users case when using an MCS/Queued spinlock (dropped from 54.46% to 2.38%)
• The functions lookup_mnt() and __mutex_lock_slowpath() now take most of the time.
Coherency controller results of the kernel with the MCS/Queued spinlock
Coherency Controller Transactions Sent to Fabric Link (PRETRY number)
Socket Agent 10users 40users 400users
0 0 18,216 24,560 83,720,570
0 8 37,307 42,307 43,151,386
1 0 0 0 0
1 8 0 0 0
2 0 0 0 0
2 8 0 0 0
3 0 0 0 0
3 8 0 0 0
4 1 52 222 16,786
4 9 28 219 10,068
5 1 0 0 0
5 9 0 0 0
6 1 0 0 0
6 9 0 0 0
7 1 0 0 0
7 9 0 0 0
We can see that as users increase, PRETRY in socket 0 also increases – but it is significantly lower than the kernel without the MCS/Queued lock.
The PRETRY number for socket 0:400 users = 84M + 43M = 127M.
This value is about 1/5 of the original kernel (694M).
This shows the MCS/Queued spinlock algorithm reduces the PRETRY traffic that occurs in file_move() and file_kill() significantly even though we still have the same contention on the spinlock.
• The MCS/Queued spinlock improved the throughput of large systems just by minimizing the inter-socket cache line traffic generated by the locking algorithm.
• The MCS/Queued spinlock did not reduce the amount of contention on the actual lock. We have the same number of spinners contending for the lock. No code changes were made to reduce lock contention.
• However, the benchmark throughput improved from ~160,000 to ~390,000 jobs per minute due to the reduced inter-socket cache-to-cache traffic.
• System time spent spinning on the lock dropped from 54% to 2%.
• Lock algorithms can play a huge factor in the performance of large-scale systems
• The impact of heavy lock contention on a 240-core system is much more severe than the impact of heavy lock contention on a 40-core system
• This is not a substitute for reducing lock contention… Reducing lock contention is still the best solution, but attention to lock algorithms that deal with contention *is* extremely important and can yield significant improvements.
• One of the problems with significant lock contention on blocking locks (such as a mutex) is that as more processes block on the mutex there is less to run – this causes the idle balancer to take processes from a different CPU’s run queue. This in turn causes even further cache issues.
• Ensure that we don’t attempt an idle balance operation when it takes longer to do the balancing than the time the cpu would be idle
• We do this by keeping track of the maximum time spent in idle balance for each scheduler domain and skipping idle balance if max-time-to-balance > avg_idle for this CPU
• Max-time-to-balance is decayed at a rate of about 1% per second
• Improve the accuracy of the average CPU idle duration.
• Previously the average CPU idle duration was over estimated resulting in too much idle balancing
30
93,609
18,600
Java Operations with 16-sockets / 240-cores / 480-threads
• A customer acceptance benchmark demonstrated really poor performance with XFS for 4k and 16k block sizes (sometimes 64k) for initial-writes as well as over-writes for multithreaded applications.
• Further investigation identified a set of patches already developed for the upstream Linux kernel revision 3.4
• The primary patch introduces per filesystem I/O completion workqueues (as opposed to global workqueues)
• Allows concurrency on the workqueues - blocking on one inode does not block others on a different inode.
• These patches were back-ported to SLES 11sp3 (and by default now part of 11sp4 and 12)
• Improved synchronous 16k initial-write performance from 1.2 MB/s to 138 MB/s
• Improved asynchronous 16k initial-write performance from 14 MB/s to 141 MB/s
• Also improves 16k over-write performance as well as 4k initial-write and over-write performance.
− Workloads with high amounts of mutex contention would spend significant time spinning on the mutex’s internal waiter lock which then delays the mutex from getting unlocked.
− Changed the mutex unlock path to unlock the mutex before acquiring the internal waiter lock to deal with any waiters.
− Delays in acquiring the waiter lock will not prevent others from acquiring the mutex.
• Mutex slowpath optimizations
− When a lock can’t be acquired and a thread enters the mutex slowpath it put’s itself on the wait list and tries one last time to acquire the mutex.
− Changed the order and attempted the acquisition of the mutex first
− If acquired the we do not have to remove the thread from the waiter list
37
156,912
109,933
Java Operations with 16-sockets / 240-cores / 480-threads
• Reference counts are normally used to track the lifecycle of data structures.− A reference count of zero means the structure is unused and is free to be released
− A positive reference count indicates how many tasks are actively referencing the structure
− When embedded into a data structure, it is not uncommon to acquire a lock just to increment or decrement the reference count variable. Under load, this lock can become heavily contended.
• The lockref patch introduces a new mechanism for a lockless atomic update of a spinlock protected reference count. − Bundle a 4-byte spinlock and a 4-byte reference count into a single 8-byte word that can be
updated atomically while no one is holding the lock.
• The VFS layer makes heavy use of reference counts for dentry operations.− Workloads that generate lots of filesystem activity can be bottlenecked by the spinlock
contention on the dentry reference count update.
− The dentry operations were modified to make use of the lockref patch to resolve this contention by doing reference count updates without taking a lock.
38
0.01%
83.74%
% time spinning on dentry lock: AIM-7 short workload
• Heavy use of the ls command results in a significant amount of CPU time being spent in the mls_level_isvalid() kernel function.
• Replaced the inefficient implementation of the mls_level_isvalid() function in the multi-level security (MLS) policy module of SELinux with a performance optimized version.
− More efficient bit-map management
• The CPU time spent in this code path is reduced from 8.95% to 0.12% in the AIM-7 high_systime workload
40
0.12%
8.95%
Changes in system time for the mls_level_isvalid() code path
• The kernel originally serialized hugetlb page faults, handling a single fault at a time.
− Workloads with large working sets backed-by hugepages (i.e.: databases or KVM guests) can especially suffer from painful startup times due to this.
− Protection from spurious OOM errors under conditions of low availability of free hugepages.
− This problem is specific to hugepages because it is normal to want to use every single hugepage in the system - with normal pages we assume there will always be a few spare pages which can be used temporarily until the race is resolved.
• Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized.
− The hash key is selected depending on the mapping type.
− Because the size of the table is static, this can, in theory, still produce contention, if reserving enough hugepages. But reality indicates that this is purely theoretical.
41
25.7
37.5
Startup time (seconds) of a 10-Gb Oracle DB (Data Mining)
• Java workloads on 8- and 16-socket systems showed significant lock contention on the global epmutex in the epoll_ctl() system call when adding or removing file descriptors to/from an epoll instance.
• Further investigation identified a set of patches already developed for the upstream Linux kernel:
• Don’t take the global epmutex lock in EPOLL_CTL_ADD for simple topologies (it’s not needed)
• Remove the global epmutex lock from the EPOLL_CTL_DEL path and instead use RCU to protect the list of event poll waiters against concurrent traversals
• RCU (Read-Copy Update) is a Linux synchronization mechanism allowing lockless reads to occur concurrently with updates
42
165,119
87,903
Java Operations with 16-sockets / 240-cores / 480-threads
• Some of the performance enhancements HP provided for SLES 11sp3 caused a breakage in the Kernel ABI (kABI)
• The User Application ABI remains the same – all applications that run on the “default” SLES 11sp3 kernel have full binary and source compatibility with the “bigsmp” SLES 11sp3 kernel.
• There was a small possibility that this kABI breakage would impact kernel drivers and modules
• Rather than risk compatibility issues at customer sites SUSE created the “bigsmp” flavor of the SLES 11sp3 kernel which contains these additional performance enhancements.
• The bigsmp flavor of SLES 11sp3 has it’s own kABI
• Requires a recompile of kernel drivers and modules
• SUSE experience and process flexibility allowed for the creation of the bigsmp kernel so that these additional performance enhancements could be delivered to customers.
• All of these changes will be included in the SLES 11sp4 GA and SLES 12 GA “default” kernels. Bigsmp will be an on-going flavor for SLES 11sp3 for all platforms.
• The MCS lock is a new locking primitive inside Linux
• Each locker spins on a local variable while waiting for the lock rather than spinning on the lock itself.• Maintains a list of spinning waiters.
• When a lock is released the unlocker changes the local variable of the next spinner.
• This change causes the spinner to stop spinning and acquire the lock.
• Eliminates most of the cache-line bouncing experienced by simpler locks, especially in the contended case when a simple CAS (Compare-and-Swap) calls fail.
• Fair, passing the lock to each locker in the order that the locker arrived.
• Specialized cancelable MCS locking was applied internally to kernel mutexes
• The cancelable MCS lock is a specially tailored lock for MCS: when needing to reschedule, we need to abort the spinning in order to block.
46
250,981
137,268
Java Operations with 16-sockets / 240-cores / 480-threads
• A process’s address space is divided among VMAs (virtual memory areas) – each storing a range of addresses that share similar properties, such as permissions.
− A common operation when dealing with memory is locating (find_vma()) a VMA that contains a range of addresses.
• Traditionally the Linux kernel will cache the last used VMA.
− Avoids expensive tree lookups (scales poorly in multi-thread programs).
− This works nicely for workloads with good locality (over 70% hit rates), yet very badly for those with poor locality (less than 1% hit rates).
• Replace the cache by a small, per-thread, hash table.
− O(1) lookups/updates, cheap to maintain and small overhead.
− Improves poor locality hit-rates to ~99.9%.
− Improves Oracle 11g Data Mining (4k pages) hit-rates from 70% to 91%.
• #1 16-socket (16s) results on both max-jOPS and critical-jOPS
• #1 8-socket (8s) results on max-jOPS
• 16s max-jOPS results 2.1X greater than Fujitsu 16s results
• 8s max-jOPS results are 2.2X greater than Sugon 8s results
• 8s max-jOPS results 1.1X greater than Fujitsu16s results
• HP CS900 demonstrates excellent scaling from 8s to 16s
23,058
168,127
198,418
126,617
247,581
214,961
308,936
425,348
474,575
888,164
0 200,000 400,000 600,000 800,000 1,000,000
Sugon I980G108-socket, Intel Xeon E7-8890 v2
Fujitsu SPARC M10-4S16-socket, SPARC64 X
Fujitsu SPARC M10-4S16-socket, SPARC64 X+
HP ConvergedSystem 900 for SAPHANA 8s/6TB 8-socketIntel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAPHANA 16s/12TB 16-socket
Intel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAP HANApowered by SLES 11sp3
owns top two SPECjbb2013 max-jOPS records
max-jOPS
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.org as of 07/30/2014.
HP ConvergedSystem 900 for SAPHANA (8s/6TB) 8-socket
Intel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAPHANA (16s/12TB) 16-socket
Intel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAP HANA
powered by SLES 11sp3
#1 16-socket SPECjbb2013 max-jOPS record
max-jOPS
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.org as of 11/14/2014.
• The HP BL920s Gen8 Server Blade powers the HP ConvergedSystem 900 for SAP HANA system.
• Publicly available SPECjbb2013-MultiJVM benchmark performance briefs:
TBD – provide link for new brief Nov 2014
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3288ENW&cc=us&lc=en June 2014
• Official benchmark results for HP ConvergedSystem900 for SAP HANA on spec.org:
TBD – provide link for new result (16s/240c/12TB) Nov 2014
TBD – provide link for new result (8s/120c/6TB) Nov 2014
http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00081.html (16s/240c/12TB) June 2014
http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00080.html (8s/120c/6TB) June 2014
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).
• Traditional UNIX system-level benchmark (written in C).
• Multiple forks, each of which concurrently executes a common, randomly-orderedset of subtests called jobs.
• Each of the over fifty kind of jobs exercises a particular facet of system functionality• Disk IO operations, process creation, virtual memory operations, pipe I/O, and compute-bound arithmetic loops.
• AIM7 includes disk subtests for sequential reads, sequential writes, random reads, random writes, and randommixed reads and writes.
• An AIM7 run consists of a series of subruns with the number of tasks, N, beingincreased after the end of each subrun.
• Each subrun continues until each task completes the common set of jobs. Theperformance metric, "Jobs completed per minute", is reported for each subrun.
• The result of the entire AIM7 run is a table showing the performance metric versusthe number of tasks, N.
• Reference: “Filesystem Performance and Scalability in Linux 2.4.17”, 2002.70
• To measure some of the changes done by the futex hastable patchset, a futex set of microbenchmarks are added to perf-bench:− perf bench futex [<operation> <all>]
• Measures latency of different operations:− Futex hash