UNAS-20140123-1800

1/23/2014 12:15 PM

Geunsik Lim

Sungkyunkwan University

Samsung Electronics

Evolution of Memory Architecture System software running on a NUMA architecture needs to be aware of the processor

topology in order to properly allocate memory and processes to maximize performance.

UMA (Uniform Memory Architecture)

NUMA (Non-Uniform Memory Architecture)

CPU

CPU CPU

CPU

Memory

Memory C1 C0

C2 C3 Memory

C1 C0

C2 C3

Memory C1 C0

C2 C3 Memory

C1 C0

C2 C3

Interconnection Network Nodes, Sockets, Cores, Threads

Operating System

Run-time System

2

Must end users be NUMA-aware?

Users must be aware of PCIe device slot placement

Optimal NUMA tuning is not yet performed by the OS

Persistent tuning is a non-trivial task

Performance challenges are changing faster than tools

Unfortunately, yes.

3

Motivation

IBM Cell Broadband Engine ccNUMA architecture (SGI Altix 3000)

ccNUMA architecture (IBM)

24 Gb/s 24 Gb/s

Actually, server administrators have different OS knowledge.

Therefore, all administrators can not manage the NUMA server for the optimizing

memory utilization and performance in real environment.

4

What is the goal mainly?

Propose the user-space automatic service daemon to

provide the best performance, by avoiding unnecessary

latency. (For newbies at the server administration)

Binding Processes to NUMA Nodes Automatically in user-space

Automatically Improve NUMA System Performance with the

Proposed system

Also, support the manual setting infrastructure like the

existing system for the veteran.

5

Related work

Approach Pros. Cons.

Autonuma

• Kernel-space

• Purely OS approach

• Not aggressive approach

Numa Balancer

• Kernel-space



Melgorman’s

MM

• Kernel-space



Sergey

• User-space approach

• Aggressive approach

• Manual configuration

• Damage of memory utilization

because of Affinity method

• It’s no memory scheduler

UNAS

• Automatic

• User-space approach

• Easy to manage

• Aggressive approach

• Don’t follow-up in-depth Memory

management

• It’s no memory scheduler

6

What is UNAS?

UNAS is a user-space scheduler that monitors NUMA topology and usage

UNAS distributes loads for good locality for the purpose of providing the best

performance, by avoiding unnecessary latency.

Goal of UNAS is to automatically bind processes to NUMA nodes as GPL license.

Initial allocation New NUMA-aware allocation

7

Design

User-space

Scheduler

NUMA List

. . . .

Monitor Reporter

Collect

NUMA

Specific Data

Task

Pro

cFS

& S

ysF

S

User-Space Scheduler

User-Space Runtime Monitor

User-space Kernel-space

NUMA Memory Node

8

Proposed Scheduler Algorithm 1. Monitor: Runtime monitoring mechanism

1. Create a new thread for receiving and dealing with the run-time monitoring data

2. Repeat monitoring until NUMA-aware user-space scheduler stop

3. Sleep for an NUMA specific data (from /proc/stat)

4. Collect the monitoring report

5. End Repeat loop

Algorithm 2. Reporter: Collected NUMA specified data reporting mechanism

Input: run-time monitoring data

1. Repeat until runtime monitoring mechanism stop

2. Receiving data and filtering them from online monitoring

3. Collect NUMA specific data

4. If loading of system is unbalanced or behavior of the processes changed or powerful core

is idle

5. Computing the Run-time speedup factor

6. Sorting the process NUMA list by multi-core speedup factor

7. Computing the contention degradation factor

8. Sorting the process NUMA list by contention degradation factor

9. Sending signal to trigger schedule

10. End if

11. End Repeat loop

9

Proposed Scheduler

Algorithm 3. User-space Scheduler: Automatic NUMA aware scheduling

Input: NUMA list

1. Computing the number of powerful core candidate based on load balanced memory policy

2. Retrieving suitable processes to be scheduled on powerful cores from NUMA list

3. Setting static CPU pin from manual input of administrator

4. If retrieved processes != current processes on powerful cores

5. Migrate the processes

6. End if

7. If current resource contention degradation is too big

8. Scatter the processes with heavy contention

9. Calculating degradation factor in order to minimize resource contention degradation

10. Migrate the processes and the its sticky pages

11. End if

10

Flowchart of Proposed Scheduler

Monitoring the characteristics

of NUMA

Setting static CPU pin

manually

Allocate Memory based on

monitoring info.

Reallocate for optimal

allocation dynamically

Per 10

Seconds

Manual Setting by

Administrator

END

START New allocation

Re-allocation

• /proc/<pid>/stat

• /proc/<pid>/numa_maps

• /sys/class/numa_topology

11

Implementation of UNAS Content Default Value

Max Nodes 256

Max CPUs 2,048

C

P

U

CPU Threshold 30

CPU Scale Factor 100

Memory Threshold 300 MB

Implementation for ( ; ; ) {

if (NUMA) {

update_processes(); "/proc/%s/stat"

interval = manage_loads(); bind_process_and_migrate_memory()

time_interval(10);

}

} invain@numa-server:/proc/2$> cat ./stat

2028 (Xorg) S 1987 2028 2028 1031 2028 4202752 8778 0 41 0 13259 443 0 0 20 0 9 0

2644 238051328 6541 18446744073709551615 1 1 0 0 0 0 0 4096 1367369423

18446744073709551615 0 0 17 53 0 0 30 0 0

Priority

invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep heap

7f9219662000 default heap anon=2979 dirty=2979 N0=2 N1=2975 N2=2

invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep stack

7fffb6601000 default stack anon=37 dirty=37 N1=37

CPU (0~79)

#Node of

Heap

#Node of

Stack 12

Evaluation

40 cores + 40 threads

• Server : DELL PowerEdge R910 • CPU: Intel Xeon E7-4850 @2.00GHz (40 Cores) • Memory: 32GiB • OS: Linux 3.2. • Platform: Ubuntu 12.04 LTS • Benchmarks: PARSEC

UNAS

13

Evaluation

14

Evaluation

15

Evaluation

16

References

Auto NUMA Ver 26 : http://lwn.net/Articles/488709/

Peter Zijlstra's NUMA scheduling patch set : http://lwn.net/Articles/486858/

NUMA system call : get_mempolicy(2), mbind(2), igrate_pages(2), move_pages(2), and set_mempolicy(2).

Libnuma : Link with -lnuma to get the system call definitions. The numactl package is available at ftp://oss.sgi.com/www/projects/libnuma/download/. Applications should not use these system calls directly. The higher level interface provided by the numa(3) functions in the numactl package is recommended.

RHEL 6.3 : Redhat Enterprise Linux ver 6.3; http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/resource_management-tp.html

Sergey Blagodurov et al., "A Case for NUMA-aware Contention Management on Multicore Systems," USENIX ATC 2011

Yinan Li, et al., "NUMA-Aware Algorithms: the Case of Data Shuffling," CIDR, 2013

17

Conclusion

All administrators can not easily manage the NUMA server for the optimizing memory utilization and performance in real environment.

UNAS is a standalone daemon that monitors NUMA topology and usage in real-time.

UNAS distributes loads for good locality for the purpose of providing the best performance.

UNAS automatically bind processes to NUMA nodes as Beer license

18

Thank you for your attention

Any questions?

19

BACKUP SLIDES

In Case We Have More Time…

20

Migrating pages to optimize NUMA locality

Content NUMASCHED SchedNUMA AutoNUMA Automatic NUMA balancing

Who Lee Schermerhorn (HP) Peter Zijlstra (REDHAT) Andrea Arcangeli (REDHAT) Mel Gorman (SUSE)

Progress RFC (at LPC2010) PATCH v1 (Rewrite NUMASCHED)

Alpha 23 Since Linux 3.8

Key factors

lazy/auto-migration Allowing processes to be put into "NUMA groups" that will share the same home node.

Scanning / auto-migration SCHEDNUMA + AUTONUMA

Details Migration when a fault handler such as do_swap_page() finds a cached page with zero mappings

1) allowing processes to be put into "NUMA groups : int numa_mbind (); 2) the NUMA group identified by ng_id : int numa_tbind();

Pagetable scanner / knuma_migrated per NUMA node queues

Migration when fault Migration w/ PTE (Migrate On Reference Of pte_numa Node [MORON])

Operations

automatic page migration for Virtualization on X86_64

new system calls http://lwn.net/Articles/486850/

git clone --reference linux -b autonuma-alpha10 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v4r38

Eval 1) Refer to NUMA Balancer

55% faster than mainline (Dan Smith )

35% faster than mainline (Dan Smith )

mmtest utility

21

• GOAL : Keep processes and their memory together on the same NUMA node.

Support for automatically migrating pages to optimize NUMA locality

• Eval 1): https://lkml.org/lkml/2012/3/20/508

• Autonuma benchmark ver 0.1 - git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git

• mmtest by Mel Gorman - http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz (autonumabench)

Time

Tools for NUMA Tuning

numactl

cgroups

taskset

lstopo

dmidecode

sysfs

irqbalance

numad

top

numatop

htop

tuna

irqstat

tuned-adm

Removal of existing bottlenecks

Multi-queue block layer: http://kernel.dk/blk-mq.pdf

Improved tools

numatop: https://01.org/numatop

top: https://gitorious.org/procps/procps (top: added NUMA support)

irqstat: https://github.com/lanceshelton/irqstat (IRQ viewer for NUMA)

Performance profiling methods: http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-

linuxperformance-checklist/

NUMA-aware TCMalloc http://developer.amd.com/wordpress/media/2013/03/NUMA-aware-

TCMalloc.zip

22

Appendix /proc/[number]/numa_maps (since Linux

2.6.14)

This file displays information about a process's NUMA memory policy and allocation.

Each line contains information about a memory range used by the process, displaying--among other information--the effective memory policy for that

memory range and on which nodes the pages have been allocated.

numa_maps is a read-only file. When /proc/<pid>/numa_maps is read, the kernel will scan the virtual address space of the process and report how memory isused. One line is displayed for each unique memory range of the process.

23

Appendix. /proc/[number]/numa_maps (since

Linux 2.6.14) cont’d

• http://www.kernel.org/doc/man-pages/online/pages/man7/numa.7.html

• http://man7.org/linux/man-pages/man7/numa.7.html

24

Appendix “numactl” Sample

numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 and 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments Run big database with its memory interleaved on all CPUs.

numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory allocated on node 0 and 1.

numactl --cpubind=0 --membind=0,1 -- process -l Run process as above, but with an option (-l) that would be confused with a numactl option.

numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting state.

numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory regiion specified by /tmp/shmkey over all nodes.

numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.

numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the default localalloc policy

25

UNAS-20140123-1800

Documents