Introduction to Memory Optimization

An Introduction to Memory Optimization Techniques

Koray Hagen

My background

1. Software engineer in the games industry1. League of Legends2. Hearthstone3. PlayStation 4 4. Xbox 360

2. Worked on many optimization problems for:

1. Game scalability2. Content pipelines3. Client run-time4. Server run-time5. Data formats

3. One rule. Performance is king.

Prerequisite Knowledge

1. Exposure to C or C++

2. Exposure to computer architecture

Thank you to SCEA Santa Monica Studio

1. Christer Ericson1. VP, Central Technology, Activision2. Previously, Director of Technology, SCEA3. Author of “Real-time Collision Detection”4. Author of original optimization presentation5. Authority on optimization and game physics

The agenda

1. “The Black Box”

2. Memory hierarchies and cache

3. Optimization techniques for instructions and data

4. Aliasing and restriction

5. Closing thoughts and further reading

What won’t be covered

1. Data-Oriented Design1. Modern object-oriented programming is polluting programmer’s minds2. A refocus on creating better representations and computation around data rather than abstractions

2. SIMD or other approaches to vectorized code generation and usage1. Instruction level parallelism2. A deep dive into the losing war between processor speed and memory speed

The Black BoxChallenges in the modern era of computing

The downward spiral of performance

1. There is now an accelerating gap between CPU and memory performance1. CPU speed increasingly annually by ~ 60%2. Memory speed increasingly annually by ~10%

2. The gap has been closed by use of cache memory1. Recent renewed interest by the C++ community2. Unfortunately cache is still vastly unexploited3. Diminishing returns for large caches (physical locality)

3. Advances in instruction parallelism are overshadowing data performance1. Data consumption at run-time is astronomically high

4. Inefficient cache use is equal to lower performance1. Most obvious question, how do I increase cache utilization?2. Answer: cache aware programming/programmers (you after today’s slides)

Memory hierarchies and cacheA look at current architectures

An overview of cache

1. Memory hierarchy1. Discrete instruction cache2. Discrete data cache

2. Cache Lines1. Cache is physically divided into cache lines of N bytes (typically 32 and 64 bit) each2. The discrete unit for counting memory accesses

3. Example architecture – Direct mapping1. For an N kilobyte cache, bytes at position k, k+n, k+2n … map to a cache line

4. Example architecture – N-way associative1. Logical cache line corresponds to N physical lines2. Minimization of cache thrashing

Theoretical memory hierarchy

1 cycle

~ 2-5 cycles

~ 5-20 cycles

~ 40-100 cyclesMain memory

L2 cache

L1 cache

Example cache specifications

1. Emergence of L3 cache for high end processors2. Nothing magical about speed, strict physical locality requirements to the co-

processors and main memory

L1 cache (I &D) L2 cache

PlayStation 4 256 KB 4 MBWii U 64 KB 3 MB

Xbox One 256 KB 4 MBPC 512K 6 MB

Beware the three C’s of cache misses

1. Compulsory misses1. Unavoidable misses when reading in data for the first time.

2. Capacity misses1. Not enough cache space to hold all active data2. Too much data accessed in between successive use

3. Conflict misses1. Two blocks of memory are mapped to the same location and there is not enough room

to hold both, ultimately causing thrashing

Introduce the three R’s into your program

1. Rearrange (code and data)1. Change layouts to increase spatial locality

2. Reduce (size and number of cache lines read)1. Create smaller and smarter formats2. Compression

3. Reuse (cache lines)1. Increase temporal and spatial locality

Instruction and data cache optimizationStrategies for performance and cache-awareness

Instruction optimization strategy

1. Locality1. Reorder functions

1. Manually within the file2. Reorder object files during linker stage3. Visual studio intrinsic:

1. #pragma section("section-name" [attributes])2. Adapt coding style

1. Balance between monolithic functions and separation of logic2. Encapsulation and OOP are less cache friendly – usually not all

3. Implicit code generation1. Example (casting: cvttss2si)2. Study the code that your compiler generates3. Build intuition regarding how the compiler optimizes

Instruction optimization strategy … continued

1. Size1. Beware inlining, unrolling, large macros!

1. Always understand the cost-value tradeoffs of programming decisions2. Avoid unnecessary features and code paths3. Loop splitting and loop compounding

2. Again, always study the generated code.

Data optimization strategy

1. Compress data1. Does not necessarily mean compression algorithms2. Can you store more in less?

2. Cache conscious data layouts1. Padding to align to cache lines2. Reordering to align to cache lines3. Ordering variables by personal preference has no value

3. Linearizing data1. Array based data structures

Structure field data reordering

Data is likely accessed together, so store them together

Be aware of compiler padding

1. What are the values of size_one and size_two?

2. Hint: not the same, so how is member data aligned?

3. Ordering member data by self-enlightened organization is the result of bad programmer habits.

“Hot and cold” data division

1. Achieve much better cache coherence by striving for temporal locality among data members.

2. How often is your data in cache? Data access must scale towards the most common case, not the worst case.

“Hot and cold” data division … continued

Linearization of data

1. Nothing better than linear data1. Overall best spatial locality, values are right next to each other2. Easy to pre-fetch, and will result in better cache line hit probability.

2. What if my data can’t be represented linearized easily?1. Linearize at run-time, there is no excuse2. Fetch and store into a custom cache3. Great candidates for linearization

1. Hierarchy traversal2. Indexed data3. Random-access data

Matrix multiplication example

1. The result of bad programmer habits, and programming towards an assumed general case.1. How can it be better? 2. What options do we have?3. How can we save ourselves?

Matrix multiplication example … continued

1. But wait! There is a hidden assumption that result is not lhs is not rhs2. Compiler does not and cannot know this, more on this later

1. Line 3: lhs[0][0] and lhs[0][1] must be re-fetched2. Line 4: rhs[0][0] and rhs[1][0] must be re-fetched3. Line 5: lhs[1][0], lhs[1][1], rhs[0][1] and rhs[1][1] must be re-fetched

3. We can do even better

1. Let’s try unrolling the multiplication

Matrix multiplication example … continued

1. Cache all inputs, leave no room for unneeded indirection2. Write to needed memory locations once3. Result

1. No branches2. No conditionals3. No aliasing4. No side effects

Real Example

Aliasing and restricted pointersRun-time costs the compiler will never tell you

What is aliasing?

Aliasing is multiple references to the same storage location

What value is returned? Is it 1 or 2?Nobody knows

Penalties for introducing abstractions

1. Higher levels of abstraction have a negative effect on optimization1. Objected oriented code naturally inclines programmers toward cache obliviousness2. “Information hiding” key principle, potentially hiding insights into achieving optimal

performance

2. Inevitably lots of temporary objects

3. Objects live on the heap and stack1. Subject to aliasing problems2. Constant indirection to access and transform any meaningful data

4. Implicit aliasing through the this pointer1. Member variables are just as bad as globals

Penalties for introducing abstractions … continued

1. m_count is a member not local variable, therefore implicit this pointer2. m_count may be aliased by m_ptr

3. Every iteration is likely to refetch m_count from main memory

Penalties for introducing abstractions … continued

Are you sure the compiler does this optimization for you? Don’t leave it up to chance.

Restricted pointers

1. Restrict keyword1. Supported by many C++ compilers (MSVC, GCC)2. Controversial among standard committees

2. Restrict is a promise1. Tells the compiler that for the scope of the pointer, the target location of the pointer

will only be accessed through that pointer alone. It is a promise not to alias.

3. Important in C++1. Helps combat abstraction penalty problems2. Tricky semantics, easy to get wrong3. Compiler will never inform you about incorrect usage4. Incorrect usage results in agonizing pain

Restricted pointers … continued

What you really want is the compiler to generate this

… But because of aliasing, the compiler cannot do it

The fix? Restrict the pointers

1. Prefer an explicit coding style, leave nothing to chance2. Be careful and pragmatic, understand what code paths can be taken with

functions3. Remember, a restrict qualified pointer can grant access to a non-restrict pointer

Remember, despite intuition “const” doesn’t help

1. “Wait, since *rhs is const lhs[i] cannot write to it right?” … WRONG2. const promises that *rhs is const through rhs, NOT that *rhs is const in general3. const is for detecting programming errors, not fixing aliasing

Tips for avoiding aliasing

1. Minimize use of globals pointers and references1. Recall the semantics of the Matrix example2. Pass small variables by value3. Use local variables as much as possible

2. Restrict pointers and references when appropriate

3. Declare variables close to the point of use, and no further

4. Aim to write “pure” functions, and strive for const-correctness

5. Study generated code!

Optimization isn’t magic

1. Strive for explicitness in programming1. Leave no room for unintended side effects

2. Understand the hardware architecture that is being targeted1. Constant factors in programming matter2. Relevant for all platforms

1. Game consoles2. Mobile3. Servers4. ... Even normal desktops/laptops

3. Not over, many more topics to explore1. Branch prediction2. SIMD and vectorized code3. Cache-aware data structures

Introduction to Memory Optimization

cache lines2

cache friendly

cache space

cache line4

cache lines3

cache utilization

minimization of cache

n kilobyte cache

Software

Randomized Sketching for Low-Memory Dynamic Optimization

1 Code Optimization. 2 Outline Machine-Independent...

Introduction to optimization · Introduction to...

Road to High-Performance 3D ICs: Performance Optimization...

Memory Partitioning and Scheduling Co-Optimization in ...

Performance Optimization for Mobile Memory Sub-Systems

Introduction to Classical Optimization Methods -...

Introduction to optimization - TUT · Introduction to...

An Introduction to Optimization Theory. Outline Introduction...

Memory/Cache Optimization

Parallel Performance Optimization ASD Shared Memory HPC ...

Introduction to Optimization - École...

Memory Optimization

Lecture 9: Memory Optimization - University of...

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

SHAPE MEMORY ALLOY TORQUE TUBE DESIGN OPTIMIZATION …