Multithreading and Parallelism on iOS [MobOS 2013]

Multithreading and Parallelism on iOS

Kuba Brecka @kubabrecka !Mobile Operating Systems Conference MobOS 2013

Agenda

• Part I: Parallelism and multithreading overview

• Part II: Thread-safety, GCD, operation queues

• Part III: Synchronization, locking, memory model

• Part IV: Performance tuning, ILP

• Part V: (at the party) Whatever you’d like to discuss

Multithreading and Parallelism on iOS Part I: Parallelism and multithreading overview

Quiz 1

int a; !- (void)method { a = 0; ! dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); ! dispatch_async(queue, ^{ a = 1; }); ! dispatch_async(queue, ^{ a = 2; }); ! NSLog(@"%d", a); }

Quiz 2

int a; !- (void)method { a = 0; dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); ! dispatch_async(queue, ^{ a = 1; }); while (a == 0) { // wait } ! NSLog(@"%d", a); }

Parallelism is a huge topic

Terminology

• Parallel

• Multi-threaded

• Concurrent

• Simultaneous

• Asynchronous

Why parallelize?

• Responsiveness

• “when I scroll, it’s smooth”

• Performance

• “it works fast”

• Energy saving

• “it doesn’t drain my battery”

• Convenience

• some things are parallel by nature, e.g. running two completely separate apps

How?

• Multiple processes

• XPC, fork

• Multiple threads

• POSIX Threads, NSThread

• High-level thread abstraction

• Operation queues, dispatch queues

• GPGPU

• Instruction-level parallelism

• superscalar CPUs, pipelining, vector instructions

• Multiple PCs

• servers, clouds

Threads

• What is a thread?

• It’s an abstraction made by the OS

• The CPU has no such concept

• Represents a line of calculation

• Has an ID, a stack, thread-local storage, priority, CPU registers

• Shares memory and resources within a process

• The OS scheduler runs/pauses threads

• context switching

Issues with threading

• Race conditions

• the result depends on the timing of the scheduler

• the behavior is non-deterministic

• can result in almost anything

• crash, wrong result, corrupted data

• So, you have to use locks/mutexes/…

• More issues: deadlocks, livelocks, starvation

• Even the best guys have trouble with these

• Security consequences, vulnerabilities

Know your enemy

• The compiler

• The CPU

• The memory

• Time

• Your brain

The iPhone has matured

iPhone 4 512 MB RAM

A4 SoC (1 core) 800 MHz

iPhone 4S 512 MB RAM


iPhone 5 1 GB RAM


iPhone 5S 1 GB RAM


ARM has matured

• Apple A5 (2011)

• ARM Cortex-A9 MPCore

• 2 cores

• out-of-order execution

• speculative execution

• superscalar, pipelining (8 stages)

• NEON 128-bit SIMD

• Apple A7 (2013)

• ARMv8-A “Cyclone”

• 64-bit, 32 registers, per-core L1 cache

iOS has matured

• The kernel knows a lot more about the system than the developer

• GCD

• Operation Queues

• LLVM, compiler optimizations

• GPU computations

• Accelerate.framework

iOS threading technologies

• Multiple processes – forking disabled, no XPC

• Low-level threads

• POSIX Threads (pthread)

• NSThread

• -[NSObject performSelectorInBackground:withObject:]

• Higher-level abstractions

• NSOperationQueue, NSOperation

• GCD

Is multithreading hard?

• Yes, if you don’t know what you’re doing.

• But that’s true for anything.

• Paul E. McKenney: Is Parallel Programming Hard, And, If So, What Can You Do About It? (2013)

• https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

You need to know how it works

• The abstractions you use (threads, dispatch queues) are leaky

• You still must know how it works below:

• CPU

• OS

• compiler (LLVM)

• libraries and 3rd party code you are using

• language specification

• language implementation

• + the abstraction you are using (GCD)

You need to know even more

• Often you parallelize to get better performance

• For this you need to know

• CPU architecture details

• CPU instruction latencies

• memory hierarchy and latencies

Parallelizing tasks vs. algorithms

• Task = a standalone unit of work

• has some inputs, gives some outputs

• “add a blur effect to these 1000 photos”

• 1 photo = 1 task (independent)

• “add a blur effect to this one 5000x5000px photo”

• 1 task = ?

• Some algorithms simply cannot be parallelized (you will not get any significant speedup)

Multithreading and Parallelism on iOS Part II: Thread-safety, GCD, operation queues

What’s thread safety?

• “Thread-safe object”

• you can safely use the object from multiple threads at the same time

• the internal state of the object will not get corrupted and it will behave correctly

• When you don’t know if an object is thread-safe, you have to assume it isn’t

• How do you make your object thread-safe?

• immutability, locks, atomic reads/writes

Shared mutable state

• Exclusive immutable object = no problem

• Shared immutable object = no problem

• Exclusive mutable object = no problem

• Shared mutable object

• root of all evil

• you always want to minimize this

Global variables

• “Global variables are bad”

• Multi-threading is another very good reason not to use global variables / global state

• Global variables are always shared

• Watch out for “hidden” global state:

• working directory, chdir()

• environment variables, putenv()

Thread-safety vs. iOS

• Terrible lack of proper documentation

• Most of the low-level Obj-C runtime is thread-safe

• memory management, ARC, weak references, …

• Immutable objects (NSString, NSArray, …) are thread-safe

• A few other classes are thread-safe

• Usually it’s thread-safe to call class methods

• google for “iOS thread safety”

• https://developer.apple.com/library/ios/DOCUMENTATION/Cocoa/Conceptual/Multithreading/ThreadSafetySummary/ThreadSafetySummary.html

https://developer.apple.com/library/ios/DOCUMENTATION/Cocoa/Conceptual/Multithreading/ThreadSafetySummary/ThreadSafetySummary.html

POSIX threads

• “plain threads”

• C API

• if you want to pass an object to the new thread, you will have issues with memory management

• Synchronization

• mutexes, conditions, R/W locks, barriers

POSIX thread API

• pthread_create

• pthread_join

• mutex

• pthread_mutex_init, pthread_mutex_lock, pthread_mutex_unlock

• conditions

• pthread_cond_init, pthread_cond_signal, pthread_cond_wait

NSThread

• “plain threads” as well

• Obj-C API

• mostly just a wrapper around POSIX threads

• memory management just works

• Synchronize with NSLock, NSCondition, …

NSThread API

• -[NSThread initWithTarget:selector:object:]

• -[NSThread start]

• +[NSThread detachNewThreadSelector:toTarget:withObject:]

• subclassing NSThread

• -[NSObject performSelectorInBackground:withObject:]

Thread-specific properties

• Thread-local storage

• Thread priorities

• Autorelease pools

• Detached vs. joinable

Grand Central Dispatch

• Let’s not think about threads

• Instead, let’s think about tasks

• New concepts:

• Tasks

• Queues

• Queue-specific data

• Dispatch groups

• Dispatch sources

• Synchronization

• Semaphores, barriers

• C API (!) but has ARC and works with blocks

GCD queues

• Main queue

• there is just one, executed on the main thread

• Concurrent queue

• tasks run concurrently

• 4 pre-made concurrent queues with different priorities

• DISPATCH_QUEUE_PRIORITY_DEFAULT, _HIGH, _LOW, _BACKGROUND

• you can make your own

• Serial queue

• only one task at a time, in order

• you can make your own

GCD task API

• Get/create a queue:

• dispatch_get_global_queue

• dispatch_get_main_queue

• dispatch_queue_create

• Submit task:

• dispatch_sync

• dispatch_async

• dispatch_apply

GCD convenience API

• dispatch_once

• guarantees the code run only run once

• use to implement a proper and fast singleton

• dispatch_after

• execute the task at a specific time

It’s not threads

• GCD uses threads, but the threads are completely managed by GCD

• You can’t assume your code will run on any specific thread

• even two tasks from the same serial queue can run on different threads

• Don’t use thread-local storage

• Don’t use thread priorities

Operation queues

• A similar abstraction to GCD, this time you have:

• NSOperation

• either a block, a method call or custom subclass

• concurrent or non-concurrent

• dependencies on other NSOperations

• support for cancellation

• NSOperationQueue

• executes the operations, or you can execute an operation directly

Operation queues API

• -[NSOperationQueue addOperation:]

• -[NSOperationQueue addOperationWithBlock:]

• -[NSOperation addDependency:]

• +[NSBlockOperation blockOperationWithBlock:]

• -[NSInvocationOperation initWithTarget:selector:object:]

Comparison

• POSIX threads, NSThread

• thread-based

• you have control over the lifetime of threads

• overhead when creating

• memory-management issues

• GCD, operation queues

• task-based

• nice API with objects/blocks

• operation queues

• dependencies

Run loops and messaging

• Avoid shared mutable state

• For POSIX threads and NSThreads:

• put your thread into an event loop, where it just waits until an event occurs

• the main thread has this by default

• hidden inside UIApplicationMain

• then you can communicate with the thread through:

• -[NSObject performSelector:onThread:withObject:waitUntilDone:]

Run loop API

• +[NSRunLoop currentRunLoop]

• -[NSRunLoop run]

• you have to add at least one input source or it will return immediately

• but you can add an empty port

• [NSMachPort port]

• -[NSRunLoop addPort:forMode:]

Main thread

• first thread = main thread = UI thread

• all rendering

• all layout

• scrolling, panning, zooming

• user input (touches, on-screen keyboard, external keyboard)

• system events

• Yes, that’s a lot of work.

• 60 FPS = 16 ms per frame

• Yes, that’s very little time.

Offload the main thread

• Goal: Keep the UI thread responsive

• Rule:

• Do as much work as possible on other threads

• Well, but…

•Do as little work as possible in the background, that is just enough to keep the main thread responsive

• Measure, measure, measure

Rendering and animations

• Your app doesn’t have access to the GPU/display

• Background process called “backboardd”

• IPC – rendering commands

• Shared memory – backing stores

• CAAnimations are transferred to backboardd and performed without any communication with your app

Demo 1 https://github.com/kubabrecka/mobos-ios

https://github.com/kubabrecka/mobos-ios

Multithreading and Parallelism on iOS Part III: Synchronization, locking, memory model



Only trust what’s guaranteed

• The order of things isn’t guaranteed unless someone tell you:

int a, b; // global variables!// thread 1b = 20;a = 10;

// thread 2wait for a to be 10NSLog(@“%d”, b); // ?

Solutions

• Avoid shared mutable state

• communicate by message passing

• design your objects as immutable

• avoid multithreading

• Synchronization

• You must always have “a plan”

• if you can’t tell which code is supposed to run in which thread, then nobody can help you

• if you can’t tell which data can be accessed from which thread, then nobody can help you

So what is guaranteed?

• Semantics for one thread

• “the (single-threaded) code you wrote will have the correct result”

• For multi-threaded code, you have to obtain guarantees by using:

• Atomic data types, volatile keyword

• Locks, semaphores, memory barriers

• For 3rd party code, generally you can’t assume anything

Atomic types

• Which data types are atomic?

• Depends on the architecture!

• Pointers and “native” integers are usually atomic

• What does an atomic data type guarantee?

• Also depends on the architecture!

• A single read or a single write is usually atomic

• Definitely not “i++”

• OSAtomicIncrement, …

Objective-C atomic properties

• @property (atomic) int a;

• Only affects auto-generated getters and setters

• Again, a single read is atomic, a single write is atomic

• Again, “obj.a++” is not atomic

• It has no effect on direct member access, obj->a

• “atomic” is default

Objective-C messaging

• Is the order of Obj-C method calls guaranteed?

• It seems so, the current compilers don’t optimize through the dynamic dispatch (objc_msgSend)

• But it’s still not guaranteed

• This might (and probably will) change in the future

Volatile keyword

• don’t confuse with Java volatile

• prevents some compiler optimizations

• the variable can change on its own

• doesn’t give you atomicity

• doesn’t give you ordering

• there are better means of synchronization

Locks

• Mutexes, critical sections

• allow only a single thread to be in this part of code at the same time

• -[NSLock lock]

• -[NSLock unlock]

• @synchronized { … }

• uses an implicit lock, which exists on each object

• handles exceptions

• Recursive locks, R/W locks, conditions

Lock-free algorithms and data structures

• Some concurrent structures (hash tables, queues) can be written without using explicit locks

• Currently a major topic in CS

• databases

• The name is confusing though, there is still a lot of locking happening

• cache coherency

• memory bus locking for complex atomic operations

Memory barriers

• Locks can be expensive

• Memory barrier ensured ordering without locking

• Memory reads and writes happen on the other side of the barrier

• But the guarantee is only at the point of the barrier!

• OSMemoryBarrier

Is the trouble worth it?

• Measure!

• OK, so you need more than a single thread

• use task-level parallelization (GCD) with clear input and output, use immutable data and message passing

• Measure again!

• OK, so you need more than that

• find the bottleneck, don’t assume

• is it really the CPU? Isn’t the bottleneck in the memory/network/disk?



Multithreading and Parallelism on iOS Part IV: Performance tuning, ILP

Multithreading isn’t everything

• There are plenty of ways to make your code run faster

• avoiding unnecessary work

• choosing better algorithms

• calculations on the GPU

• using vector instructions (AVX, SSE, NEON)

• hand-optimizing your assembly

• tweaking the compiler optimizations

The bottleneck

• It’s easy to make wrong assumptions

• Your bottleneck can be

• CPU

• Memory

• I/O (disk, network)

• GPU

• There is no “usually”

Some common UI issues

• Creating UIViews is slow

• reuse views, dequeue cells in tables

• Loading images is slow

• cache images

• Rendering is slow

• avoid drawRect, consider rasterization of flattened views

• Scrolling is slow

• don’t do heavy work in scrollViewDidScroll

• Rendering shadows is slow

• use shadowPath

• Rendering layer masks is slow

• pre-render

Choose your data structures

• -[NSArray containsObject:]

• O(n)

• -[NSSet containsObject:]

• O(1)

Always profile first

• Don’t guess, measure!

• Amdahl’s law

• Hardware is cheap, programmers are expensive

Profiling with Instruments

• What can you measure with Instruments?

• CPU

• utilization

• all performance counters (interrupts, syscalls, user/kernel time, …)

• Memory

• free memory

• allocations, leaks, “zombies”

• many more performance counters (page faults, cache hits/misses, …)

• Network

• Battery usage

• Display FPS

• Single process / multiple processes

• …

Measure carefully

• Instruments isn’t perfect

• Sampling is only a statistic method

• Real device behave very differently than simulators

• Hardware is different

• Compiled code is different (both yours and libraries)

• Verify your assumptions

• In many cases, wrapping your code with two calls to [NSDate date] and subtracting is the best approach

Optimize memory/cache accesses

• Cache lines (64 B)

• Try to linearize memory accesses

• Choose correct data structures

• array of structs vs. struct of arrays

• Aligned memory accesses

Instruction-level parallelism

• The compiler tries to maximize ILP with scheduling

• The main obstacle is data dependency

• a series of arithmetic operations which depend on each other simply cannot be parallelized

• independent operations are easily parallelized

• CPU is superscalar and has deep pipelines

• the problem is that often the compiler can’t be sure about the dependency

• memory accesses, aliasing

• it has to assume the dependency is there

Help the compiler

• The compiler is smart:

• GCC: dead code elimination, common subexpression elimination, forward propagation, loop unrolling, tail call elimination, loop invariant motion, lower complex arithmetic, vectorization, modulo scheduling, …

• Sometimes, it would like to be smart, but it can’t:

• the C “restrict” keyword (C99):

void * memcpy(void * restrict s1, const void * restrict s2, size_t n);

Vector instructions

• SIMD = Single Instruction Multiple Data

• ARM NEON

• 128-bit instructions (e.g. 4x 32-bit or 16x 8-bit at once)

• LLVM auto-vectorizer

• Often you have to change your data structure

• alignment

• interleaved values

Accelerate.framework

• Heavily optimized built-in framework for:

• image processing

• image format conversion and encoding/decoding

• DSP, FFT

• various general math on “large” data

#include <Accelerate/Accelerate.h>!vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...vy = vsinf(vx);

Away from the CPU

• GPGPU

• Only through OpenGL ES shaders

• Perfect for image processing (Core Image, GPUImage)

• M7 motion coprocessor (iPhone 5S)

Thank you for your attention.

Multithreading and Parallelism on iOS

Kuba Brecka @kubabrecka !Mobile Operating Systems Conference MobOS 2013

Multithreading and Parallelism on iOS [MobOS 2013]

Technology