Utilization, Simultaneous Multi-threading & Virtual Processors

AIX CPU & Virtual

Processors

Steve [email protected]

© 2012 IBM Corporation2

Utilization, Simultaneous Multi-threading & Virtual Processors


Review: POWER6 vs POWER7 SMT Utilization

■ Simulating a single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each of these cases, physical consumption can be reported as 1.0.

■ Real world production workloads will involve dozens to thousands of threads, so many users may not notice any difference in the “macro” scale

■ See Simultaneous Multi-Threading on POWER7 Processors by Mark Funkhttp://www.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdfProcessor Utilization in AIX by Saravanan Devendran https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Power%20Systems/page/Understanding%20CPU%20utilization%20on%20AIX

busy

idle

POWER6 SMT2

Htc0

Htc1

100% busy

busy

idle

POWER7 SMT4

~65% busy

idle

idle

busy

idle

POWER7 SMT2

~70% busy

busy

busy

100% busy

busy

busy

100% busy

Htc0

Htc1

Htc0

Htc1

Htc0

Htc1

Htc0

Htc1

Htc2

Htc3

“busy” = user% + system%

POWER5/6 utilization does not account for SMT, POWER7 is calibrated in hardware


Review: POWER6 vs POWER7 Dispatch

■ There is a difference between how workloads are distributed across cores in POWER7 and earlier architectures

– In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded

– In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as Raw Throughput mode.

– Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores

POWER6 SMT2busy

idle

POWER7 SMT4

~50% busy

idle

idle

busy

busy

~80% busy

Htc0

Htc1

Htc0

Htc1

Htc2

Htc3

VirtualProcessorActivate


Review: POWER6 vs POWER7 Dispatch

proc0 proc1 proc2 proc3

Once a Virtual Processor is dispatched, the Physical Consumption metric will typically increase to the next whole number

Put another way, the more Virtual Processors you assign, the higher your Physical Consumption is likely to be in POWER7


Primary

Secondary

Tertiaries

POWER7

POWER6 Primary

Secondary


■ POWER7 will activate more cores at lower utilization levels than earlier architectures when excess VP’s are present– Customers may complain that the physical consumption metric

(reported as physc or pc) is equal to or possibly even higher after migrations to POWER7 from earlier architecture

– Every POWER7 customer with this complaint to also have significantly higher idle% percentages over earlier architectures

– Consolidation of workloads and may result in many more VP’s assigned to the POWER7 partition

■ Customers may also note that CPU capacity planning is more difficult in POWER7. If they will not reduce VPs, they may need subtract %idle from the physical consumption metrics for more accurate planning.– POWER5 & POWER6, 80% utilization was closer to 1.0 physical core– In POWER7 with excess VPs, in theory, all of the VPs could be

dispatched and the system could be 40-50% idle– Thus, you cannot get to higher utilization of larger systems if you have

lots of VPs which are only exercising the primary SMT thread – you will have high Physical Consumption with lots of idle capacity

POWER7 Consumption: A Problem?


■ Apply APARs in backup section, these can be causal for many of the high consumption complaints

■ Beware allocating many more Virtual Processors than sized

■ Reduce Virtual Processor counts to activate secondary and tertiary SMT threads– Utilization percentages will go up, physical consumption will remain

equal or drop– Use nmon, topas, sar or mpstat to look at logical CPUs. If only primary

SMT threads are in use with a multi-threaded workload, then excess VP’s are present.

■ A new alternative is Scaled Throughput, allowing increases in per-core utilization by a Virtual Processor

POWER7 Consumption: Solutions


Scaled Throughput


■ Scaled Throughput is an alternative to the default “Raw” AIX scheduling mechanism– It is an alternative for some customers at the cost of partition performance– It is not an alternative to addressing AIX and pHyp defects, partition

placement issues, realistic entitlement settings and excessive Virtual Processor assignments

– It will dispatch more SMT threads to a VP/core before unfolding additional VPs

– It can be considered to be more like the POWER6 folding mechanism, but this is a generalization, not a technical statement

– Supported on POWER7/POWER7+, AIX 6.1 TL08 & AIX 7.1 TL02

■ Raw vs Scaled Performance – Raw provides the highest per-thread throughput and best response times at

the expense of activating more physical cores– Scaled provides the highest core throughput at the expense of per-thread

response times and throughput. It also provides the highest system-wide throughput per VP because tertiary thread capacity is “not left on the table.”

What is Scaled Throughput?


POWER7 Raw vs Scaled Throughput

lcpu0-3

proc0

lcpu4-7

proc1

lcpu8-11

proc2

lcpu12-15

proc3

Once a Virtual Processor is dispatched, physical consumption will typically increase to the next whole number

63% 63% 63% 63%77% 77% 77% 77%88% 88%88% 88%100%100% 100% 100%


Primary

Secondary

Tertiaries

Rawdefault

ScaledMode 2


ScaledMode 4


■ Tunings are not restricted, but you can be sure that anyone experimenting with this without understanding the mechanism may suffer significant performance impacts– Dynamic schedo tunable– Actual thresholds used by these modes are not documented and may

change at any time

■ schedo –p –o vpm_throughput_mode=0 Legacy Raw mode (default)1 Scaled or “Enhanced Raw” mode with a higher threshold than

legacy2 Scaled mode, use primary and secondary SMT threads4 Scaled mode, use all four SMT threads

■ Tunable schedo vpm_throughput_core_threshold sets a core count at which to switch from Raw to Scaled Mode– Allows fine-tuning for workloads depending on utilization level– VP’s will “ramp up” quicker to a desired number of cores, and then be

more conservative under chosen Scaled mode

Scaled Throughput: Tuning


■ Workloads– Workloads with many light-weight threads with short dispatch cycles and

low IO (the same types of workloads that benefit well from SMT)– Customers who are easily meeting network and I/O SLA’s may find the

tradeoff between higher latencies and lower core consumption attractive– Customers who will not reduce over-allocated VPs and prefer to see

behavior similar to POWER6

■ Performance– It depends, we can’t guarantee what a particular workload will do – Mode 1 may see little or no impact but higher per-core utilization with

lower physical consumed– Workloads that do not benefit from SMT and use Mode 2 or Mode 4 will

see double-digit per-thread performance degradation (higher latency, slower completion times)

Scaled Throughput: Workloads


Raw Throughput: Default and Mode 1

Raw Throughput

0

1

2

3

4

5

6

7

8

9

10

11

12

Time

Active_Threads Active_VP Phys_Busy Phys_Consumed

■ AIX will typically allocate 2 extra Virtual Processors as the workload scales up and is more instantaneous in nature

■ VP’s are activated and deactivated one second at a time

Scaled Throughput: Mode 1

0

1

2

3

4

5

6

7

8

9

10

11

12

Time


■ Mode 1 is more of a modification to the Raw (Mode 0) throughput mode, using a higher utilization threshold and moving average to prevent less VP oscillation

■ It is less aggressive about VP activations. Many workloads may see little or no performance impact


Scaled Throughput: Modes 2 & 4


0

1

2

3

4

5

6

7

8

9

10

11

12

Time


■ Mode 2 utilizes both the primary and secondary SMT threads

■ Somewhat like POWER6 SMT2, eight threads are collapsed onto four cores

■ “Physical Busy” or utilization percentage reaches ~80% of Physical Consumption


0

1

2

3

4

5

6

7

8

9

10

11

12

Time


■ Mode 4 utilizes both the primary, secondary and tertiary SMT threads

■ Eight threads are collapsed onto two cores

■ “Physical Busy” or utilization percentage reaches 90-100% of Physical Consumption


■ Never adjust the legacy vpm_fold_threshold without L3 Support guidance

■ Remember that Virtual Processors activate and deactivate on 1 second boundaries. The legacy schedo tunable vpm_xvcpus allows enablement of more VPs than required by the workload. This is rarely needed, and is over-ridden when Scaled Mode is active.

■ If you use RSET or bindprocessor function and bind a workload– To a secondary thread, that VP will always stay in at least SMT2 mode– If you bind to a tertiary thread, that VP cannot leave SMT4 mode– These functions should only be used to bind to primary threads unless you know

what you are doing or are an application developer familiar with the RSET API– Use bindprocessor –s to list primary, secondary and tertiary threads

■ A recurring question is “How do I know how many Virtual Processors are active?”– There is no tool or metric that shows active Virtual Processor count– There are ways to guess this, and looking a physical consumption (if folding is

activated), physc count should roughly equal active VPs– nmon Analyser makes a somewhat accurate representation, but over long

intervals (with a default of 5 minutes), it does not provide much resolution– For an idea at a given instant with a consistent workload, you can use: echo vpm | kdb

Tuning (other)


Virtual Processors> echo vpm | kdb

VSD Thread StateCPU CPPR VP_STATE FLAGS SLEEP_STATE PROD_TIME: SECS NSECS CEDE_LAT0 0 ACTIVE 1 AWAKE 0000000000000000 00000000 00 1 255 ACTIVE 0 AWAKE 000000005058C6DE 25AA4BBD 00 2 255 ACTIVE 0 AWAKE 000000005058C6DE 25AA636E 00 3 255 ACTIVE 0 AWAKE 000000005058C6DE 25AA4BFE 00 4 255 ACTIVE 0 AWAKE 00000000506900DD 0D0CC64B 00 5 255 ACTIVE 0 AWAKE 00000000506900DD 0D0D6EE0 00 6 255 ACTIVE 0 AWAKE 00000000506900DD 0D0E4F1E 00 7 255 ACTIVE 0 AWAKE 00000000506900DD 0D0F7BE6 00 8 11 DISABLED 1 SLEEPING 0000000050691728 358C3218 02 9 11 DISABLED 1 SLEEPING 0000000050691728 358C325A 02 10 11 DISABLED 1 SLEEPING 0000000050691728 358C319F 02 11 11 DISABLED 1 SLEEPING 0000000050691728 358E2AFE 02 12 11 DISABLED 1 SLEEPING 0000000050691728 358C327A 02 13 11 DISABLED 1 SLEEPING 0000000050691728 358C3954 02 14 11 DISABLED 1 SLEEPING 0000000050691728 358C3B13 02 15 11 DISABLED 1 SLEEPING 0000000050691728 358C3ABD 02

VP

VP

With SMT4, each core will have 4 Logical CPUs, which equals 1 Virtual ProcessorThis method is only useful for steady-state workloads


Variable Capacity Weighting


Variable Capacity Weighting - Reality

■ Do I use a partition’s Variable Capacity Weight to get uncapped capacity?– NO: PowerVM has no mechanism to distribute uncapped capacity when the

pool is not constrained– YES: It’s mechanism to arbitrate shared-pool contention

http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/iphat/iphatsharedproc.htm

■ Uncapped weight is only used where there are more virtual processors ready to consume unused resources than there are physical processors in the shared processor pool. If no contention exists for processor resources, the virtual processors are immediately distributed across the logical partitions independent of their uncapped weights. This can result in situations where the uncapped weights of the logical partitions do not exactly reflect the amount of unused capacity.

■ This behavior is only supported for the default (global) shared pool– Unused capacity is spread by the hypervisor across all active pools, and not managed on a

per-pool basis– Cycles will be distributed evenly in virtual shared (sub) pools, and not based on Variable

Capacity Weight ratios


Constraining Workloads

■ Variable Capacity Weighting is not a partition or job management utility– Environments that are out of pool resources cannot be actively

controlled with high-fidelity by fighting over scraps of unused cycles

– It is only a failsafe when an environment is running out of capacity, typically to provide a last option for the VIOS

– If you are out of CPU resources, you need to use other methods to manage workloads


Constraining Workloads■ How do I protect critical workloads? By constraining other partitions access to CPU resources

– PowerVM methods• Capping Entitlement or setting Variable Capacity Weight to 0• Dynamically reducing Virtual Processors. Entitlement can also be reduced dynamically,

and can reduce guaranteed resources, but adjusting VPs has a more direct impact.• You must establish a practical range of minimum and maximum entitlement ranges to

allow flexibility in dynamic changes• Virtual Shared Pools (or sub-pools) - can constrain a workload by setting Maximum

Processing Units of a sub-pool dynamically, but this is effectively the same as reducing VPs

• Use Capacity-on-Demand feature(s) to grow pool

– Operating System methods• Process priorities• Execute flexible workloads at different times• Workload Manager Classes or Workload Partitions• Rebalance workloads between systems


Constraining Workloads

If you’re out of capacity, you need to dial back VPs, cap partitions, move workloads (in time or placement) or use Capacity-on-Demand


Redbook & APAR Updates


Power Systems Performance Guide

http://www.redbooks.ibm.com/abstracts/sg248080.html

This is an outstanding Redbookfor new and experienced users


POWER7 Optimization & Tuning Guide

http://www.redbooks.ibm.com/abstracts/sg248079.html

A single “first stop” definitive source for a wide variety of general information and guidance, referencing other more detailed sources on particular topics

Exploitable by IBM, ISV and customer software developers

Hypervisor, OS (AIX & Linux), Java, compilers and memory details

Guidance/Links for DB2, WAS, Oracle, Sybase, SaS, SAP Business Objects


■ The most problematic performance issues with AIX were resolve in early 2012. Surprisingly, many customers are still running with these defects– Memory Affinity Domain Balancing– Scheduler/Dispatch defects– Wait process defect– TCP Retransmit– Shared Ethernet defects

■ Do not run with a firmware level below 720_101. A hypervisor dispatch defect exists below that level.

■ The next slide provides the APARs to resolve the major issues– We strongly recommend updating to these levels if you encounter

performance issues. AIX Support will likely push you to these levels before wanting to do detailed research on performance PMRs.

– All customer Proof-of-Concept or tests should use these as minimum recommended levels to start with

Performance APARs


Issue Release APAR SP/PTF

WAITPROC IDLE LOOPING CONSUMES CPU

7.1 TL16.1 TL76.1 TL66.1 TL5

IV10484IV10172IV06197IV01111

SP2 (IV09868)SP2 (IV09929)U846391 bos.mp64 6.1.6.17 or SP7U842590 bos.mp64 6.1.5.9 or SP8

SRAD load balancing issues on shared LPARs

7.1 TL16.1 TL76.1 TL66.1 TL5

IV10802IV10173IV06196IV06194


Miscellaneous dispatcher/scheduling performance fixes

7.1 TL16.1 TL76.1 TL66.1 TL5

IV10803IV10292IV10259IV11068


address space lock contention issue

7.1 TL16.1 TL76.1 TL66.1 TL5

IV10791IV10606IV03903n/a

SP2 (IV09868)SP2 (IV09929)U846391 bos.mp64 6.1.6.17 or SP7

TCP Retransmit Processing is slow (HIPER)

7.1 TL16.1 TL76.1 TL6

IV13121IV14297IV18483

SP4SP4U849886 bos.net.tcp.client

6.1.6.19 or SP8

SEA lock contention and driver issues 2.2.1.4FP25 SP02

Performance APARs – MUST HAVE


■ New global_numperm tunable has been enabled with AIX 6.1 TL7 SP4 / 7.1 TL1 SP4. Customers may experience early paging due to failed pincheck on 64K pages

■ What– Fails to steal from 4K pages when 64K pages near maximum pin percentage (maxpin) and

4K pages are available– Scenario not properly checked for all memory pools when global numperm is enabled– vmstat –v shows that the number of 64K pages pinned is close to maxpin%– svmon shows that 64K pinned pages are approaching the maxpin value

■ Action– Apply APAR– Alternatively if the APAR cannot be applied immediately, disable

numperm_global : # vmo -p -o numperm_global=0 – Tunable is dynamic, but workloads paged out will have to be paged in and performance may

suffer until that completes or a reboot is performed

■ APARs

IV26272 AIX 6.1 TL7

IV26735 AIX 6.1 TL8

IV26581 AIX 7.1 TL0

IV27014 AIX 7.1 TL1

IV26731 AIX 7.1 TL2

Early 2013 Paging Defect

Utilization, Simultaneous Multi-threading & Virtual Processors

Documents