Multithreading and Parallelism on iOS Kuba Brecka @kubabrecka Mobile Operating Systems Conference MobOS 2013
Multithreading and Parallelism on iOS
Kuba Brecka @kubabrecka !Mobile Operating Systems Conference MobOS 2013
Agenda
• Part I: Parallelism and multithreading overview
• Part II: Thread-safety, GCD, operation queues
• Part III: Synchronization, locking, memory model
• Part IV: Performance tuning, ILP
• Part V: (at the party) Whatever you’d like to discuss
Multithreading and Parallelism on iOS Part I: Parallelism and multithreading overview
Quiz 1
int a; !- (void)method { a = 0; ! dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); ! dispatch_async(queue, ^{ a = 1; }); ! dispatch_async(queue, ^{ a = 2; }); ! NSLog(@"%d", a); }
Quiz 2
int a; !- (void)method { a = 0; dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); ! dispatch_async(queue, ^{ a = 1; }); while (a == 0) { // wait } ! NSLog(@"%d", a); }
Parallelism is a huge topic
Terminology
• Parallel
• Multi-threaded
• Concurrent
• Simultaneous
• Asynchronous
Why parallelize?
• Responsiveness
• “when I scroll, it’s smooth”
• Performance
• “it works fast”
• Energy saving
• “it doesn’t drain my battery”
• Convenience
• some things are parallel by nature, e.g. running two completely separate apps
How?
• Multiple processes
• XPC, fork
• Multiple threads
• POSIX Threads, NSThread
• High-level thread abstraction
• Operation queues, dispatch queues
• GPGPU
• Instruction-level parallelism
• superscalar CPUs, pipelining, vector instructions
• Multiple PCs
• servers, clouds
Threads
• What is a thread?
• It’s an abstraction made by the OS
• The CPU has no such concept
• Represents a line of calculation
• Has an ID, a stack, thread-local storage, priority, CPU registers
• Shares memory and resources within a process
• The OS scheduler runs/pauses threads
• context switching
Issues with threading
• Race conditions
• the result depends on the timing of the scheduler
• the behavior is non-deterministic
• can result in almost anything
• crash, wrong result, corrupted data
• So, you have to use locks/mutexes/…
• More issues: deadlocks, livelocks, starvation
• Even the best guys have trouble with these
• Security consequences, vulnerabilities
Know your enemy
• The compiler
• The CPU
• The memory
• Time
• Your brain
The iPhone has matured
iPhone 4 512 MB RAM
A4 SoC (1 core) 800 MHz
iPhone 4S 512 MB RAM
A5 SoC (2 core) 800 MHz
iPhone 5 1 GB RAM
A6 SoC (2 core) 1300 MHz
iPhone 5S 1 GB RAM
A7 SoC (2 core) 1300 MHz
ARM has matured
• Apple A5 (2011)
• ARM Cortex-A9 MPCore
• 2 cores
• out-of-order execution
• speculative execution
• superscalar, pipelining (8 stages)
• NEON 128-bit SIMD
• Apple A7 (2013)
• ARMv8-A “Cyclone”
• 64-bit, 32 registers, per-core L1 cache
iOS has matured
• The kernel knows a lot more about the system than the developer
• GCD
• Operation Queues
• LLVM, compiler optimizations
• GPU computations
• Accelerate.framework
iOS threading technologies
• Multiple processes – forking disabled, no XPC
• Low-level threads
• POSIX Threads (pthread)
• NSThread
• -[NSObject performSelectorInBackground:withObject:]
• Higher-level abstractions
• NSOperationQueue, NSOperation
• GCD
Is multithreading hard?
• Yes, if you don’t know what you’re doing.
• But that’s true for anything.
• Paul E. McKenney: Is Parallel Programming Hard, And, If So, What Can You Do About It? (2013)
• https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
You need to know how it works
• The abstractions you use (threads, dispatch queues) are leaky
• You still must know how it works below:
• CPU
• OS
• compiler (LLVM)
• libraries and 3rd party code you are using
• language specification
• language implementation
• + the abstraction you are using (GCD)
You need to know even more
• Often you parallelize to get better performance
• For this you need to know
• CPU architecture details
• CPU instruction latencies
• memory hierarchy and latencies
Parallelizing tasks vs. algorithms
• Task = a standalone unit of work
• has some inputs, gives some outputs
• “add a blur effect to these 1000 photos”
• 1 photo = 1 task (independent)
• “add a blur effect to this one 5000x5000px photo”
• 1 task = ?
• Some algorithms simply cannot be parallelized (you will not get any significant speedup)
Multithreading and Parallelism on iOS Part II: Thread-safety, GCD, operation queues
What’s thread safety?
• “Thread-safe object”
• you can safely use the object from multiple threads at the same time
• the internal state of the object will not get corrupted and it will behave correctly
• When you don’t know if an object is thread-safe, you have to assume it isn’t
• How do you make your object thread-safe?
• immutability, locks, atomic reads/writes
Shared mutable state
• Exclusive immutable object = no problem
• Shared immutable object = no problem
• Exclusive mutable object = no problem
• Shared mutable object
• root of all evil
• you always want to minimize this
Global variables
• “Global variables are bad”
• Multi-threading is another very good reason not to use global variables / global state
• Global variables are always shared
• Watch out for “hidden” global state:
• working directory, chdir()
• environment variables, putenv()
Thread-safety vs. iOS
• Terrible lack of proper documentation
• Most of the low-level Obj-C runtime is thread-safe
• memory management, ARC, weak references, …
• Immutable objects (NSString, NSArray, …) are thread-safe
• A few other classes are thread-safe
• Usually it’s thread-safe to call class methods
• google for “iOS thread safety”
• https://developer.apple.com/library/ios/DOCUMENTATION/Cocoa/Conceptual/Multithreading/ThreadSafetySummary/ThreadSafetySummary.html
POSIX threads
• “plain threads”
• C API
• if you want to pass an object to the new thread, you will have issues with memory management
• Synchronization
• mutexes, conditions, R/W locks, barriers
POSIX thread API
• pthread_create
• pthread_join
• mutex
• pthread_mutex_init, pthread_mutex_lock, pthread_mutex_unlock
• conditions
• pthread_cond_init, pthread_cond_signal, pthread_cond_wait
NSThread
• “plain threads” as well
• Obj-C API
• mostly just a wrapper around POSIX threads
• memory management just works
• Synchronize with NSLock, NSCondition, …
NSThread API
• -[NSThread initWithTarget:selector:object:]
• -[NSThread start]
• +[NSThread detachNewThreadSelector:toTarget:withObject:]
• subclassing NSThread
• -[NSObject performSelectorInBackground:withObject:]
Thread-specific properties
• Thread-local storage
• Thread priorities
• Autorelease pools
• Detached vs. joinable
Grand Central Dispatch
• Let’s not think about threads
• Instead, let’s think about tasks
• New concepts:
• Tasks
• Queues
• Queue-specific data
• Dispatch groups
• Dispatch sources
• Synchronization
• Semaphores, barriers
• C API (!) but has ARC and works with blocks
GCD queues
• Main queue
• there is just one, executed on the main thread
• Concurrent queue
• tasks run concurrently
• 4 pre-made concurrent queues with different priorities
• DISPATCH_QUEUE_PRIORITY_DEFAULT, _HIGH, _LOW, _BACKGROUND
• you can make your own
• Serial queue
• only one task at a time, in order
• you can make your own
GCD task API
• Get/create a queue:
• dispatch_get_global_queue
• dispatch_get_main_queue
• dispatch_queue_create
• Submit task:
• dispatch_sync
• dispatch_async
• dispatch_apply
GCD convenience API
• dispatch_once
• guarantees the code run only run once
• use to implement a proper and fast singleton
• dispatch_after
• execute the task at a specific time
It’s not threads
• GCD uses threads, but the threads are completely managed by GCD
• You can’t assume your code will run on any specific thread
• even two tasks from the same serial queue can run on different threads
• Don’t use thread-local storage
• Don’t use thread priorities
Operation queues
• A similar abstraction to GCD, this time you have:
• NSOperation
• either a block, a method call or custom subclass
• concurrent or non-concurrent
• dependencies on other NSOperations
• support for cancellation
• NSOperationQueue
• executes the operations, or you can execute an operation directly
Operation queues API
• -[NSOperationQueue addOperation:]
• -[NSOperationQueue addOperationWithBlock:]
• -[NSOperation addDependency:]
• +[NSBlockOperation blockOperationWithBlock:]
• -[NSInvocationOperation initWithTarget:selector:object:]
Comparison
• POSIX threads, NSThread
• thread-based
• you have control over the lifetime of threads
• overhead when creating
• memory-management issues
• GCD, operation queues
• task-based
• nice API with objects/blocks
• operation queues
• dependencies
Run loops and messaging
• Avoid shared mutable state
• For POSIX threads and NSThreads:
• put your thread into an event loop, where it just waits until an event occurs
• the main thread has this by default
• hidden inside UIApplicationMain
• then you can communicate with the thread through:
• -[NSObject performSelector:onThread:withObject:waitUntilDone:]
Run loop API
• +[NSRunLoop currentRunLoop]
• -[NSRunLoop run]
• you have to add at least one input source or it will return immediately
• but you can add an empty port
• [NSMachPort port]
• -[NSRunLoop addPort:forMode:]
Main thread
• first thread = main thread = UI thread
• all rendering
• all layout
• scrolling, panning, zooming
• user input (touches, on-screen keyboard, external keyboard)
• system events
• Yes, that’s a lot of work.
• 60 FPS = 16 ms per frame
• Yes, that’s very little time.
Offload the main thread
• Goal: Keep the UI thread responsive
• Rule:
• Do as much work as possible on other threads
• Well, but…
•Do as little work as possible in the background, that is just enough to keep the main thread responsive
• Measure, measure, measure
Rendering and animations
• Your app doesn’t have access to the GPU/display
• Background process called “backboardd”
• IPC – rendering commands
• Shared memory – backing stores
• CAAnimations are transferred to backboardd and performed without any communication with your app
Demo 1 https://github.com/kubabrecka/mobos-ios
Multithreading and Parallelism on iOS Part III: Synchronization, locking, memory model
Demo 2 https://github.com/kubabrecka/mobos-ios
Only trust what’s guaranteed
• The order of things isn’t guaranteed unless someone tell you:
int a, b; // global variables!// thread 1b = 20;a = 10;
// thread 2wait for a to be 10NSLog(@“%d”, b); // ?
Solutions
• Avoid shared mutable state
• communicate by message passing
• design your objects as immutable
• avoid multithreading
• Synchronization
• You must always have “a plan”
• if you can’t tell which code is supposed to run in which thread, then nobody can help you
• if you can’t tell which data can be accessed from which thread, then nobody can help you
So what is guaranteed?
• Semantics for one thread
• “the (single-threaded) code you wrote will have the correct result”
• For multi-threaded code, you have to obtain guarantees by using:
• Atomic data types, volatile keyword
• Locks, semaphores, memory barriers
• For 3rd party code, generally you can’t assume anything
Atomic types
• Which data types are atomic?
• Depends on the architecture!
• Pointers and “native” integers are usually atomic
• What does an atomic data type guarantee?
• Also depends on the architecture!
• A single read or a single write is usually atomic
• Definitely not “i++”
• OSAtomicIncrement, …
Objective-C atomic properties
• @property (atomic) int a;
• Only affects auto-generated getters and setters
• Again, a single read is atomic, a single write is atomic
• Again, “obj.a++” is not atomic
• It has no effect on direct member access, obj->a
• “atomic” is default
Objective-C messaging
• Is the order of Obj-C method calls guaranteed?
• It seems so, the current compilers don’t optimize through the dynamic dispatch (objc_msgSend)
• But it’s still not guaranteed
• This might (and probably will) change in the future
Volatile keyword
• don’t confuse with Java volatile
• prevents some compiler optimizations
• the variable can change on its own
• doesn’t give you atomicity
• doesn’t give you ordering
• there are better means of synchronization
Locks
• Mutexes, critical sections
• allow only a single thread to be in this part of code at the same time
• -[NSLock lock]
• -[NSLock unlock]
• @synchronized { … }
• uses an implicit lock, which exists on each object
• handles exceptions
• Recursive locks, R/W locks, conditions
Lock-free algorithms and data structures
• Some concurrent structures (hash tables, queues) can be written without using explicit locks
• Currently a major topic in CS
• databases
• The name is confusing though, there is still a lot of locking happening
• cache coherency
• memory bus locking for complex atomic operations
Memory barriers
• Locks can be expensive
• Memory barrier ensured ordering without locking
• Memory reads and writes happen on the other side of the barrier
• But the guarantee is only at the point of the barrier!
• OSMemoryBarrier
Is the trouble worth it?
• Measure!
• OK, so you need more than a single thread
• use task-level parallelization (GCD) with clear input and output, use immutable data and message passing
• Measure again!
• OK, so you need more than that
• find the bottleneck, don’t assume
• is it really the CPU? Isn’t the bottleneck in the memory/network/disk?
Demo 3 https://github.com/kubabrecka/mobos-ios
Multithreading and Parallelism on iOS Part IV: Performance tuning, ILP
Multithreading isn’t everything
• There are plenty of ways to make your code run faster
• avoiding unnecessary work
• choosing better algorithms
• calculations on the GPU
• using vector instructions (AVX, SSE, NEON)
• hand-optimizing your assembly
• tweaking the compiler optimizations
The bottleneck
• It’s easy to make wrong assumptions
• Your bottleneck can be
• CPU
• Memory
• I/O (disk, network)
• GPU
• There is no “usually”
Some common UI issues
• Creating UIViews is slow
• reuse views, dequeue cells in tables
• Loading images is slow
• cache images
• Rendering is slow
• avoid drawRect, consider rasterization of flattened views
• Scrolling is slow
• don’t do heavy work in scrollViewDidScroll
• Rendering shadows is slow
• use shadowPath
• Rendering layer masks is slow
• pre-render
Choose your data structures
• -[NSArray containsObject:]
• O(n)
• -[NSSet containsObject:]
• O(1)
Always profile first
• Don’t guess, measure!
• Amdahl’s law
• Hardware is cheap, programmers are expensive
Profiling with Instruments
• What can you measure with Instruments?
• CPU
• utilization
• all performance counters (interrupts, syscalls, user/kernel time, …)
• Memory
• free memory
• allocations, leaks, “zombies”
• many more performance counters (page faults, cache hits/misses, …)
• Network
• Battery usage
• Display FPS
• Single process / multiple processes
• …
Measure carefully
• Instruments isn’t perfect
• Sampling is only a statistic method
• Real device behave very differently than simulators
• Hardware is different
• Compiled code is different (both yours and libraries)
• Verify your assumptions
• In many cases, wrapping your code with two calls to [NSDate date] and subtracting is the best approach
Optimize memory/cache accesses
• Cache lines (64 B)
• Try to linearize memory accesses
• Choose correct data structures
• array of structs vs. struct of arrays
• Aligned memory accesses
Instruction-level parallelism
• The compiler tries to maximize ILP with scheduling
• The main obstacle is data dependency
• a series of arithmetic operations which depend on each other simply cannot be parallelized
• independent operations are easily parallelized
• CPU is superscalar and has deep pipelines
• the problem is that often the compiler can’t be sure about the dependency
• memory accesses, aliasing
• it has to assume the dependency is there
Help the compiler
• The compiler is smart:
• GCC: dead code elimination, common subexpression elimination, forward propagation, loop unrolling, tail call elimination, loop invariant motion, lower complex arithmetic, vectorization, modulo scheduling, …
• Sometimes, it would like to be smart, but it can’t:
• the C “restrict” keyword (C99):
void * memcpy(void * restrict s1, const void * restrict s2, size_t n);
Vector instructions
• SIMD = Single Instruction Multiple Data
• ARM NEON
• 128-bit instructions (e.g. 4x 32-bit or 16x 8-bit at once)
• LLVM auto-vectorizer
• Often you have to change your data structure
• alignment
• interleaved values
Accelerate.framework
• Heavily optimized built-in framework for:
• image processing
• image format conversion and encoding/decoding
• DSP, FFT
• various general math on “large” data
#include <Accelerate/Accelerate.h>!vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...vy = vsinf(vx);
Away from the CPU
• GPGPU
• Only through OpenGL ES shaders
• Perfect for image processing (Core Image, GPUImage)
• M7 motion coprocessor (iPhone 5S)
Thank you for your attention.
Multithreading and Parallelism on iOS
Kuba Brecka @kubabrecka !Mobile Operating Systems Conference MobOS 2013