1/23/2014 12:15 PM Geunsik Lim Sungkyunkwan University Samsung Electronics
1/23/2014 12:15 PM
Geunsik Lim
Sungkyunkwan University
Samsung Electronics
Evolution of Memory Architecture System software running on a NUMA architecture needs to be aware of the processor
topology in order to properly allocate memory and processes to maximize performance.
UMA (Uniform Memory Architecture)
NUMA (Non-Uniform Memory Architecture)
CPU
CPU CPU
CPU
Memory
Memory C1 C0
C2 C3 Memory
C1 C0
C2 C3
Memory C1 C0
C2 C3 Memory
C1 C0
C2 C3
Interconnection Network Nodes, Sockets, Cores, Threads
Operating System
Run-time System
2
Must end users be NUMA-aware?
Users must be aware of PCIe device slot placement
Optimal NUMA tuning is not yet performed by the OS
Persistent tuning is a non-trivial task
Performance challenges are changing faster than tools
Unfortunately, yes.
3
Motivation
IBM Cell Broadband Engine ccNUMA architecture (SGI Altix 3000)
ccNUMA architecture (IBM)
24 Gb/s 24 Gb/s
Actually, server administrators have different OS knowledge.
Therefore, all administrators can not manage the NUMA server for the optimizing
memory utilization and performance in real environment.
4
What is the goal mainly?
Propose the user-space automatic service daemon to
provide the best performance, by avoiding unnecessary
latency. (For newbies at the server administration)
Binding Processes to NUMA Nodes Automatically in user-space
Automatically Improve NUMA System Performance with the
Proposed system
Also, support the manual setting infrastructure like the
existing system for the veteran.
5
Related work
Approach Pros. Cons.
Autonuma
• Kernel-space
• Purely OS approach
• Not aggressive approach
Numa Balancer
• Kernel-space
• Purely OS approach
• Not aggressive approach
Melgorman’s
MM
• Kernel-space
• Purely OS approach
• Not aggressive approach
Sergey
• User-space approach
• Aggressive approach
• Manual configuration
• Damage of memory utilization
because of Affinity method
• It’s no memory scheduler
UNAS
• Automatic
• User-space approach
• Easy to manage
• Aggressive approach
• Don’t follow-up in-depth Memory
management
• It’s no memory scheduler
6
What is UNAS?
UNAS is a user-space scheduler that monitors NUMA topology and usage
UNAS distributes loads for good locality for the purpose of providing the best
performance, by avoiding unnecessary latency.
Goal of UNAS is to automatically bind processes to NUMA nodes as GPL license.
Initial allocation New NUMA-aware allocation
7
Design
User-space
Scheduler
NUMA List
. . . .
Monitor Reporter
Collect
NUMA
Specific Data
Task
Pro
cFS
& S
ysF
S
User-Space Scheduler
User-Space Runtime Monitor
User-space Kernel-space
NUMA Memory Node
8
Proposed Scheduler Algorithm 1. Monitor: Runtime monitoring mechanism
1. Create a new thread for receiving and dealing with the run-time monitoring data
2. Repeat monitoring until NUMA-aware user-space scheduler stop
3. Sleep for an NUMA specific data (from /proc/stat)
4. Collect the monitoring report
5. End Repeat loop
Algorithm 2. Reporter: Collected NUMA specified data reporting mechanism
Input: run-time monitoring data
1. Repeat until runtime monitoring mechanism stop
2. Receiving data and filtering them from online monitoring
3. Collect NUMA specific data
4. If loading of system is unbalanced or behavior of the processes changed or powerful core
is idle
5. Computing the Run-time speedup factor
6. Sorting the process NUMA list by multi-core speedup factor
7. Computing the contention degradation factor
8. Sorting the process NUMA list by contention degradation factor
9. Sending signal to trigger schedule
10. End if
11. End Repeat loop
9
Proposed Scheduler
Algorithm 3. User-space Scheduler: Automatic NUMA aware scheduling
Input: NUMA list
1. Computing the number of powerful core candidate based on load balanced memory policy
2. Retrieving suitable processes to be scheduled on powerful cores from NUMA list
3. Setting static CPU pin from manual input of administrator
4. If retrieved processes != current processes on powerful cores
5. Migrate the processes
6. End if
7. If current resource contention degradation is too big
8. Scatter the processes with heavy contention
9. Calculating degradation factor in order to minimize resource contention degradation
10. Migrate the processes and the its sticky pages
11. End if
10
Flowchart of Proposed Scheduler
Monitoring the characteristics
of NUMA
Setting static CPU pin
manually
Allocate Memory based on
monitoring info.
Reallocate for optimal
allocation dynamically
Per 10
Seconds
Manual Setting by
Administrator
END
START New allocation
Re-allocation
• /proc/<pid>/stat
• /proc/<pid>/numa_maps
• /sys/class/numa_topology
11
Implementation of UNAS Content Default Value
Max Nodes 256
Max CPUs 2,048
C
P
U
CPU Threshold 30
CPU Scale Factor 100
Memory Threshold 300 MB
Implementation for ( ; ; ) {
if (NUMA) {
update_processes(); "/proc/%s/stat"
interval = manage_loads(); bind_process_and_migrate_memory()
time_interval(10);
}
} invain@numa-server:/proc/2$> cat ./stat
2028 (Xorg) S 1987 2028 2028 1031 2028 4202752 8778 0 41 0 13259 443 0 0 20 0 9 0
2644 238051328 6541 18446744073709551615 1 1 0 0 0 0 0 4096 1367369423
18446744073709551615 0 0 17 53 0 0 30 0 0
Priority
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep heap
7f9219662000 default heap anon=2979 dirty=2979 N0=2 N1=2975 N2=2
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep stack
7fffb6601000 default stack anon=37 dirty=37 N1=37
CPU (0~79)
#Node of
Heap
#Node of
Stack 12
Evaluation
40 cores + 40 threads
• Server : DELL PowerEdge R910 • CPU: Intel Xeon E7-4850 @2.00GHz (40 Cores) • Memory: 32GiB • OS: Linux 3.2. • Platform: Ubuntu 12.04 LTS • Benchmarks: PARSEC
UNAS
13
Evaluation
14
Evaluation
15
Evaluation
16
References
Auto NUMA Ver 26 : http://lwn.net/Articles/488709/
Peter Zijlstra's NUMA scheduling patch set : http://lwn.net/Articles/486858/
NUMA system call : get_mempolicy(2), mbind(2), igrate_pages(2), move_pages(2), and set_mempolicy(2).
Libnuma : Link with -lnuma to get the system call definitions. The numactl package is available at ftp://oss.sgi.com/www/projects/libnuma/download/. Applications should not use these system calls directly. The higher level interface provided by the numa(3) functions in the numactl package is recommended.
RHEL 6.3 : Redhat Enterprise Linux ver 6.3; http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/resource_management-tp.html
Sergey Blagodurov et al., "A Case for NUMA-aware Contention Management on Multicore Systems," USENIX ATC 2011
Yinan Li, et al., "NUMA-Aware Algorithms: the Case of Data Shuffling," CIDR, 2013
17
Conclusion
All administrators can not easily manage the NUMA server for the optimizing memory utilization and performance in real environment.
UNAS is a standalone daemon that monitors NUMA topology and usage in real-time.
UNAS distributes loads for good locality for the purpose of providing the best performance.
UNAS automatically bind processes to NUMA nodes as Beer license
18
Thank you for your attention
Any questions?
19
BACKUP SLIDES
In Case We Have More Time…
20
Migrating pages to optimize NUMA locality
Content NUMASCHED SchedNUMA AutoNUMA Automatic NUMA balancing
Who Lee Schermerhorn (HP) Peter Zijlstra (REDHAT) Andrea Arcangeli (REDHAT) Mel Gorman (SUSE)
Progress RFC (at LPC2010) PATCH v1 (Rewrite NUMASCHED)
Alpha 23 Since Linux 3.8
Key factors
lazy/auto-migration Allowing processes to be put into "NUMA groups" that will share the same home node.
Scanning / auto-migration SCHEDNUMA + AUTONUMA
Details Migration when a fault handler such as do_swap_page() finds a cached page with zero mappings
1) allowing processes to be put into "NUMA groups : int numa_mbind (); 2) the NUMA group identified by ng_id : int numa_tbind();
Pagetable scanner / knuma_migrated per NUMA node queues
Migration when fault Migration w/ PTE (Migrate On Reference Of pte_numa Node [MORON])
Operations
automatic page migration for Virtualization on X86_64
new system calls http://lwn.net/Articles/486850/
git clone --reference linux -b autonuma-alpha10 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v4r38
Eval 1) Refer to NUMA Balancer
55% faster than mainline (Dan Smith )
35% faster than mainline (Dan Smith )
mmtest utility
21
• GOAL : Keep processes and their memory together on the same NUMA node.
Support for automatically migrating pages to optimize NUMA locality
• Eval 1): https://lkml.org/lkml/2012/3/20/508
• Autonuma benchmark ver 0.1 - git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git
• mmtest by Mel Gorman - http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz (autonumabench)
Time
Tools for NUMA Tuning
numactl
cgroups
taskset
lstopo
dmidecode
sysfs
irqbalance
numad
top
numatop
htop
tuna
irqstat
tuned-adm
Removal of existing bottlenecks
Multi-queue block layer: http://kernel.dk/blk-mq.pdf
Improved tools
numatop: https://01.org/numatop
top: https://gitorious.org/procps/procps (top: added NUMA support)
irqstat: https://github.com/lanceshelton/irqstat (IRQ viewer for NUMA)
Performance profiling methods: http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-
linuxperformance-checklist/
NUMA-aware TCMalloc http://developer.amd.com/wordpress/media/2013/03/NUMA-aware-
TCMalloc.zip
22
Appendix /proc/[number]/numa_maps (since Linux
2.6.14)
This file displays information about a process's NUMA memory policy and allocation.
Each line contains information about a memory range used by the process, displaying--among other information--the effective memory policy for that
memory range and on which nodes the pages have been allocated.
numa_maps is a read-only file. When /proc/<pid>/numa_maps is read, the kernel will scan the virtual address space of the process and report how memory isused. One line is displayed for each unique memory range of the process.
23
Appendix. /proc/[number]/numa_maps (since
Linux 2.6.14) cont’d
• http://www.kernel.org/doc/man-pages/online/pages/man7/numa.7.html
• http://man7.org/linux/man-pages/man7/numa.7.html
24
Appendix “numactl” Sample
numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 and 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments Run big database with its memory interleaved on all CPUs.
numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory allocated on node 0 and 1.
numactl --cpubind=0 --membind=0,1 -- process -l Run process as above, but with an option (-l) that would be confused with a numactl option.
numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting state.
numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory regiion specified by /tmp/shmkey over all nodes.
numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.
numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the default localalloc policy
25