Simulations of Memory Hierarchy LAB 2: CACHE LAB
Dec 17, 2015
Simulations of Memory Hierarchy
LAB 2: CACHE LAB
OVERVIEW• Objectives
• Cache Set-Up
• Command line parsing
• Least Recently Used (LRU)
• Matrix Transposition
• Cache-Friendly Code
OBJECTIVE• There are two parts to this lab:
• Part A: Cache Simulator
• Simulate a cache table using the LRU algorithm
• Part B: Optimizing Matrix Transpose
• Write “cache-friendly” code in order to optimize cache hits/misses in the implementation of a matrix transpose function
• When submitting your lab, please submit the handin.tar file as described in the instructions.
MEMORY HIERARCHY• Pick your poison: smaller, faster, and costlier, or larger,
slower, and cheaper
CACHE ADDRESSING• X-bit memory addresses (in Part A, X <= 64 bits)
• Block offset: b bits
• Set index: s bits
• Tag bits: X – b – s
• Cache is a collection of S=2^s cache sets
• Cache set is a collection of E cache lines
• E is the associativity of the cache
• If E=1, the cache is called “direct-mapped”
• Each cache line stores a block of B=2^b bytes of data
ADDRESS ANATOMY
CACHE TABLE BASICS• Conditions:
• Set size (S)
• Block size (B)
• Line size (E)
• Note that the total capacity of this cache would be S*B*E
• Blocks are the fundamental units of the cache
CACHE TABLE CORRESPONDENCE WITH ADDRESS
Example for 32 bit address
CACHE SET LOOK-UP• Determine the set index and the tag bits based on the
memory address
• Locate the corresponding cache set and determine whether or not there exists a valid cache line with a matching tag
• If a cache miss occurs:
• If there is an empty cache line, utilize it
• If the set is full then a cache line must be evicted
TYPES OF CACHE MISSES• Compulsory Miss:
• First access to a block has to be a miss
• Conflict Miss:
• Level k cache is large enough, but multiple data objects all map to the same level k block
• Capacity Miss:
• Occurs when the working set of blocks (blocks of memory being used) is larger than the cache
PART A:CACHE SIMULATION
YOUR OWN CACHE SIMULATOR• NOT a real cache
• Block offsets are NOT used but are important in understanding the concept of a cache
• s, b, and E given at runtime
FUNCTIONS TO USE FOR COMMAND LINE PARSING• int getopt(int argc, char*const* argv, const char*
options)
• See: http://www.gnu.org/software/libc/manual/html_node/Example-of-Getopt.html#Example-of-Getopt
• long long int strtoll(const char* str, char** endptr, int base)
• See: http://www.cplusplus.com/reference/cstdlib/strtoll/
LEAST RECENTLY USED (LRU) ALGORITHM
• A least recently used algorithm should be used to determine which cache lines to evict in what order
• Each cache line will need some sort of “time” field which should be update each time that cache line is referenced
• If a cache miss occurs in a full cache set, the cache line with the least relevant time field should be evicted
PART B:OPTIMIZING MATRIX TRANSPOSE
WHAT IS A MATRIX TRANSPOSITION?
• The transpose of a matrix A is denoted as AT
• The rows of AT are the columns of A, and the columns of AT are the rows of A
• Example:
GENERAL MATRIX TRANSPOSITION
CACHE-FRIENDLY CODE• In order to have fewer cache misses, you must make
good use of:
• Temporal locality: reuse the current cache block if possible (avoid conflict misses [thrashing])
• Spatial locality: reference the data of close storage locations
• Tips:
• Cache blocking
• Optimized access patterns
• Your code should look ugly if done correctly
CACHE BLOCKING• Partition the matrix in question into sub-matrices
• Divide the larger problem into smaller sub-problems
• Main idea:
• Iterate over blocks as you perform the transpose as opposed to the simplistic algorithm which goes index by index, row by row
• Determining the size of these blocks will take some amount of thought and experimentation
QUESTIONS TO PONDER• What would happen if instead of accessing each index in row
order you alternated with jumping from row to row within the same column?
• What would happen if you declared only 4 local variables as opposed to 12 local variables?
• Is it possible to get rid of the local variables all together?
• What happens when accessing elements along the diagonal?
• What happens when the program is run in a different directory?
(XKCD)